JunRuiLee commented on PR #7968:
URL: https://github.com/apache/paimon/pull/7968#issuecomment-4541896390
> Thanks for the patch. I reviewed the latest version and ran the focused
tests:
>
> ```
> PYTHONPATH=. python -m unittest \
> pypaimon.tests.test_first_row_merge_function \
> pypaimon.tests.test_first_row_e2e \
> pypaimon.tests.test_partial_update_e2e
> ```
>
> They pass, but I think there is still a correctness gap before enabling
`first-row` generally in pypaimon.
>
> The new `FirstRowMergeFunction` is only used by `MergeFileSplitRead`, i.e.
when a PK split is not `raw_convertible`. However, a single-file PK split is
still raw-convertible, so `TableRead` dispatches it through `RawFileSplitRead`
and bypasses the merge function completely. This means duplicate keys inside
one data file are returned as duplicate rows instead of applying first-row
semantics.
>
> A minimal reproduction on this PR is:
>
> ```python
> schema = Schema.from_pyarrow_schema(
> pa.schema([pa.field('id', pa.int64(), nullable=False), ('a',
pa.string())]),
> primary_keys=['id'],
> options={'bucket': '1', 'merge-engine': 'first-row'},
> )
> # one commit / one data file
> w.write_arrow(pa.Table.from_pylist([
> {'id': 1, 'a': 'first'},
> {'id': 1, 'a': 'second'},
> ], schema=pa_schema))
> ```
>
> The planned split is `raw_convertible=True`, and reading returns both rows:
>
> ```
> [{'id': 1, 'a': 'first'}, {'id': 1, 'a': 'second'}]
> ```
>
> For Java writers this is normally reduced before writing the file, but
pypaimon can create such a file today because the Python PK writer only sorts
by PK/sequence and does not merge duplicate keys before writing. So enabling
`first-row` in the read support check can expose incorrect results for data
produced by pypaimon itself.
>
> I think we should either make pypaimon writes reduce duplicate PKs for
`first-row`/`deduplicate`, or ensure the read path does not mark such splits
raw-convertible when the data file may contain duplicate keys. At minimum, this
case needs a regression test before considering the engine supported.
Thanks for pointing this out.
I mentioned this issue already in the PR description as an out-of-scope
writer-side invariant issue: pypaimon can currently write duplicate PKs into a
single data file, while `raw_convertible` reads assume each PK file has already
been folded according to the merge engine.
So my intended scope for this PR is only:
1. implement `FirstRowMergeFunction`;
2. allow `first-row` in the Python merge-read path when
`MergeFileSplitRead` is used;
3. keep the single-file duplicate-PK case tracked by #7759.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]