TheR1sing3un opened a new pull request, #7742: URL: https://github.com/apache/paimon/pull/7742
## Purpose `FileScanner._apply_push_down_limit()` only kept splits where `raw_convertible=True`, silently dropping every other split. On a PK table that has been upserted multiple times, the resulting splits typically have `raw_convertible=False` (their files overlap and need merge-on-read), so ```python table.new_read_builder().with_limit(1).new_scan().plan().splits() ``` returned an **empty** splits list and the reader produced zero rows even though the table was non-empty. Concretely the failure shape is: * PK table → write a batch → write the same key range a second time → call `with_limit(N)`. * Pre-fix: scan returns `[]`; downstream `to_arrow()` returns 0 rows. Java's `DataTableBatchScan.applyPushDownLimit()` does the right thing: it keeps all splits, and only raw-convertible splits contribute to the row-count accumulator (because their `row_count` is the post-merge count and is therefore a usable upper bound on what the scan returns). Non-raw-convertible splits can't be cheaply counted ahead of read, so they're left in the result and the reader drains them up to the user's limit. ## Fix Mirror the Java behaviour in `_apply_push_down_limit`: * Keep **every** split. * Accumulate only the row counts of raw-convertible splits. * Stop when the accumulator reaches `self.limit`. The early-exit semantics for raw-only inputs are preserved. ## Linked issue N/A — surfaced when running `with_limit(1)` against a PK table that had been upserted (a common pattern on tables tracking the latest state of a key). ## Tests `pypaimon/tests/reader_split_generator_test.py`: * `test_limit_keeps_non_raw_convertible_splits` — writes the same key range twice on a PK table to force a non-raw-convertible split, then asserts that `with_limit(1)` still returns splits **and** the reader produces rows from them. Pre-fix this test fails with `limited_splits == []` and `num_rows == 0`. * `test_limit_stops_after_raw_convertible_budget` — disjoint key ranges produce raw-convertible splits; with `limit=1` the scanner must short-circuit (returned splits must not exceed the unfiltered set), guarding the early-exit branch. Local: `pytest pypaimon/tests/reader_split_generator_test.py` → 7 passed; `flake8 --config=dev/cfg.ini` clean. ## API and format No public API change. No file format change. Behaviour change is restricted to `with_limit(N)` against PK tables whose splits had `raw_convertible=False`, which previously produced empty results. ## Documentation Inline comment on `_apply_push_down_limit` explains the row-count contract (`raw_convertible.row_count` is post-merge → usable for limit; non-raw-convertible splits can't be cheaply counted → carried through but not accumulated) and points at the corresponding Java method. ## Generative AI disclosure Drafted with assistance from an AI coding tool; root cause and fix verified against `org.apache.paimon.table.source.DataTableBatchScan.applyPushDownLimit()` and exercised by the regression tests above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
