TheR1sing3un opened a new pull request, #7808:
URL: https://github.com/apache/paimon/pull/7808
## Background
PR #7742 fixed ``with_limit`` at the **scan** layer: ``TableScan`` /
``FileScanner`` now drop splits whose row counts exceed the budget.
The **reader** layer, however, still drained every retained split to
completion before the consumer trimmed the result. On PK
merge-on-read in particular, ``with_limit(5)`` would happily merge
hundreds or thousands of rows per split and discard all but the first
five at ``to_arrow`` — the IO and CPU cost was effectively unbounded
in the limit value.
## Effect
Same query now stops at exactly N rows. The merge pipeline gains a
``LimitedRecordReader`` wrapper at its outermost stage, and
``TableRead`` tracks a counter across splits so it stops opening
further splits once the budget is met. The Ray path is capped on top
with ``ds.limit(N)`` so independent workers can't collectively
overshoot.
## Commits
1. **Add LimitedRecordReader for row-level limit pushdown** — the
wrapper plus 9 unit tests, including a ``read_batch_calls``
counter assertion that proves the inner reader is not pulled past
the limit.
2. **Push limit down through TableRead and MergeFileSplitRead** —
``ReadBuilder.new_read`` → ``TableRead.limit`` → cross-split
counter in ``to_iterator`` / ``_arrow_batch_generator`` →
``MergeFileSplitRead`` wraps the merge unwrap → ``RayDatasource``
forwards the limit and ``read_paimon`` / ``to_ray`` cap the final
Ray Dataset.
## Tests
- 9 unit tests for ``LimitedRecordReader`` (batch / iterator / close
propagation / zero / negative / does-not-drain-inner).
- 8 e2e cases in ``test_limit_pushdown.py``: append-only single
split, spans-multiple-splits, zero, oversize, PK merge with
multiple snapshots (4 different N values), PK merge with predicate
+ limit, and the ``to_iterator`` consumer.
- Existing ``reader_*_test.py`` limit cases switch from the old
"first-split-full" expectation to the new exact-N expectation.
All read-path regression tests pass locally (85/85 across
``reader_pk``, ``reader_append_only``, ``file_store_commit``,
streaming scan, split provider, ray integration).
## Out of scope
- DataEvolution + limit row-level short-circuit: that path returns
RecordBatchReaders end-to-end, which needs a separate batch-slice
treatment; left as a follow-up.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]