JunRuiLee commented on PR #7968:
URL: https://github.com/apache/paimon/pull/7968#issuecomment-4541896390

   > Thanks for the patch. I reviewed the latest version and ran the focused 
tests:
   > 
   > ```
   > PYTHONPATH=. python -m unittest \
   >   pypaimon.tests.test_first_row_merge_function \
   >   pypaimon.tests.test_first_row_e2e \
   >   pypaimon.tests.test_partial_update_e2e
   > ```
   > 
   > They pass, but I think there is still a correctness gap before enabling 
`first-row` generally in pypaimon.
   > 
   > The new `FirstRowMergeFunction` is only used by `MergeFileSplitRead`, i.e. 
when a PK split is not `raw_convertible`. However, a single-file PK split is 
still raw-convertible, so `TableRead` dispatches it through `RawFileSplitRead` 
and bypasses the merge function completely. This means duplicate keys inside 
one data file are returned as duplicate rows instead of applying first-row 
semantics.
   > 
   > A minimal reproduction on this PR is:
   > 
   > ```python
   > schema = Schema.from_pyarrow_schema(
   >     pa.schema([pa.field('id', pa.int64(), nullable=False), ('a', 
pa.string())]),
   >     primary_keys=['id'],
   >     options={'bucket': '1', 'merge-engine': 'first-row'},
   > )
   > # one commit / one data file
   > w.write_arrow(pa.Table.from_pylist([
   >     {'id': 1, 'a': 'first'},
   >     {'id': 1, 'a': 'second'},
   > ], schema=pa_schema))
   > ```
   > 
   > The planned split is `raw_convertible=True`, and reading returns both rows:
   > 
   > ```
   > [{'id': 1, 'a': 'first'}, {'id': 1, 'a': 'second'}]
   > ```
   > 
   > For Java writers this is normally reduced before writing the file, but 
pypaimon can create such a file today because the Python PK writer only sorts 
by PK/sequence and does not merge duplicate keys before writing. So enabling 
`first-row` in the read support check can expose incorrect results for data 
produced by pypaimon itself.
   > 
   > I think we should either make pypaimon writes reduce duplicate PKs for 
`first-row`/`deduplicate`, or ensure the read path does not mark such splits 
raw-convertible when the data file may contain duplicate keys. At minimum, this 
case needs a regression test before considering the engine supported.
   
   Thanks for pointing this out.
   
   I mentioned this issue already in the PR description as an out-of-scope 
writer-side invariant issue: pypaimon can currently write duplicate PKs into a 
single data file, while `raw_convertible` reads assume each PK file has already 
been folded according to the merge engine.
   
     So my intended scope for this PR is only:
     1. implement `FirstRowMergeFunction`;
     2. allow `first-row` in the Python merge-read path when 
`MergeFileSplitRead` is used;
     3. keep the single-file duplicate-PK case tracked by #7759.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to