Re: [PR] [python] Fix limit push-down discarding non-raw_convertible splits [paimon]

via GitHub Thu, 30 Apr 2026 06:32:39 -0700


TheR1sing3un commented on PR #7742:
URL: https://github.com/apache/paimon/pull/7742#issuecomment-4352873248


   @XiaoHongbo-Hope you're right — sorry, the previous round's code drifted 
from what the description claims. Pushed `3051ff69a` + `7082f1175` to fix it.
   
   I diffed our `_apply_push_down_limit` against the current Java source 
(`paimon-core/.../source/DataTableBatchScan.java:128-165`) and found three 
concrete divergences I'd introduced beyond the actual bug fix:
   
   | Aspect | Java | Previous Python fix |
   |---|---|---|
   | Gate | `mergedRowCount.isPresent()` | `raw_convertible` |
   | Append timing | only splits where merged count is known | unconditional 
(every split) |
   | Fallback | none | `merged_row_count() if not None else split.row_count` |
   
   The real bug is just the accumulator source: `split.row_count` (DV-blind, 
over-counts when DV is on, causing premature early-return) → 
`merged_row_count()` (DV-aware). Java already does this. The other two 
divergences were Python-only behaviors I should not have added.
   
   Reverted to a line-for-line port of the Java loop:
   
   ```python
   for split in splits:
       merged = split.merged_row_count()
       if merged is not None:
           limited_splits.append(split)
           scanned_row_count += merged
           if scanned_row_count >= self.limit:
               return limited_splits
   return splits
   ```
   
   Test adjustments:
   - **Removed** `test_limit_drops_non_raw_split_after_raw_budget_is_met` — its 
expectation ("non-raw split survives") was based on the unconditional-append 
behavior, which is gone now. Java does drop non-raw splits after the budget is 
met; matching this is correct.
   - **Renamed** 
`test_accumulator_falls_back_to_row_count_when_merged_unavailable` → 
`test_accumulator_skips_splits_with_unknown_merged_count` and rewrote the 
docstring to describe the actual Java-aligned behavior.
   - **Kept** `test_dv_aware_accumulator_uses_merged_row_count` as the 
master-vs-fix reproducer. Verified by swapping the file body to master's 
version: that test fails on master (`1 != 3`) and passes after the fix.
   
   PR description on the next push will be updated to drop the "retains non-raw 
splits" claim — only the accumulator-source change is what this PR delivers, 
and that part genuinely mirrors Java line-for-line.
   
   10/10 tests pass in `reader_split_generator_test.py`, flake8 clean.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [python] Fix limit push-down discarding non-raw_convertible splits [paimon]

Reply via email to