TheR1sing3un opened a new pull request, #7742:
URL: https://github.com/apache/paimon/pull/7742

   ## Purpose
   
   `FileScanner._apply_push_down_limit()` only kept splits where 
`raw_convertible=True`, silently dropping every other split. On a PK table that 
has been upserted multiple times, the resulting splits typically have 
`raw_convertible=False` (their files overlap and need merge-on-read), so
   
   ```python
   table.new_read_builder().with_limit(1).new_scan().plan().splits()
   ```
   
   returned an **empty** splits list and the reader produced zero rows even 
though the table was non-empty. Concretely the failure shape is:
   
   * PK table → write a batch → write the same key range a second time → call 
`with_limit(N)`.
   * Pre-fix: scan returns `[]`; downstream `to_arrow()` returns 0 rows.
   
   Java's `DataTableBatchScan.applyPushDownLimit()` does the right thing: it 
keeps all splits, and only raw-convertible splits contribute to the row-count 
accumulator (because their `row_count` is the post-merge count and is therefore 
a usable upper bound on what the scan returns). Non-raw-convertible splits 
can't be cheaply counted ahead of read, so they're left in the result and the 
reader drains them up to the user's limit.
   
   ## Fix
   
   Mirror the Java behaviour in `_apply_push_down_limit`:
   
   * Keep **every** split.
   * Accumulate only the row counts of raw-convertible splits.
   * Stop when the accumulator reaches `self.limit`.
   
   The early-exit semantics for raw-only inputs are preserved.
   
   ## Linked issue
   
   N/A — surfaced when running `with_limit(1)` against a PK table that had been 
upserted (a common pattern on tables tracking the latest state of a key).
   
   ## Tests
   
   `pypaimon/tests/reader_split_generator_test.py`:
   
   * `test_limit_keeps_non_raw_convertible_splits` — writes the same key range 
twice on a PK table to force a non-raw-convertible split, then asserts that 
`with_limit(1)` still returns splits **and** the reader produces rows from 
them. Pre-fix this test fails with `limited_splits == []` and `num_rows == 0`.
   * `test_limit_stops_after_raw_convertible_budget` — disjoint key ranges 
produce raw-convertible splits; with `limit=1` the scanner must short-circuit 
(returned splits must not exceed the unfiltered set), guarding the early-exit 
branch.
   
   Local: `pytest pypaimon/tests/reader_split_generator_test.py` → 7 passed; 
`flake8 --config=dev/cfg.ini` clean.
   
   ## API and format
   
   No public API change. No file format change. Behaviour change is restricted 
to `with_limit(N)` against PK tables whose splits had `raw_convertible=False`, 
which previously produced empty results.
   
   ## Documentation
   
   Inline comment on `_apply_push_down_limit` explains the row-count contract 
(`raw_convertible.row_count` is post-merge → usable for limit; 
non-raw-convertible splits can't be cheaply counted → carried through but not 
accumulated) and points at the corresponding Java method.
   
   ## Generative AI disclosure
   
   Drafted with assistance from an AI coding tool; root cause and fix verified 
against 
`org.apache.paimon.table.source.DataTableBatchScan.applyPushDownLimit()` and 
exercised by the regression tests above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to