TheR1sing3un commented on PR #7742:
URL: https://github.com/apache/paimon/pull/7742#issuecomment-4352873248
@XiaoHongbo-Hope you're right — sorry, the previous round's code drifted
from what the description claims. Pushed `3051ff69a` + `7082f1175` to fix it.
I diffed our `_apply_push_down_limit` against the current Java source
(`paimon-core/.../source/DataTableBatchScan.java:128-165`) and found three
concrete divergences I'd introduced beyond the actual bug fix:
| Aspect | Java | Previous Python fix |
|---|---|---|
| Gate | `mergedRowCount.isPresent()` | `raw_convertible` |
| Append timing | only splits where merged count is known | unconditional
(every split) |
| Fallback | none | `merged_row_count() if not None else split.row_count` |
The real bug is just the accumulator source: `split.row_count` (DV-blind,
over-counts when DV is on, causing premature early-return) →
`merged_row_count()` (DV-aware). Java already does this. The other two
divergences were Python-only behaviors I should not have added.
Reverted to a line-for-line port of the Java loop:
```python
for split in splits:
merged = split.merged_row_count()
if merged is not None:
limited_splits.append(split)
scanned_row_count += merged
if scanned_row_count >= self.limit:
return limited_splits
return splits
```
Test adjustments:
- **Removed** `test_limit_drops_non_raw_split_after_raw_budget_is_met` — its
expectation ("non-raw split survives") was based on the unconditional-append
behavior, which is gone now. Java does drop non-raw splits after the budget is
met; matching this is correct.
- **Renamed**
`test_accumulator_falls_back_to_row_count_when_merged_unavailable` →
`test_accumulator_skips_splits_with_unknown_merged_count` and rewrote the
docstring to describe the actual Java-aligned behavior.
- **Kept** `test_dv_aware_accumulator_uses_merged_row_count` as the
master-vs-fix reproducer. Verified by swapping the file body to master's
version: that test fails on master (`1 != 3`) and passes after the fix.
PR description on the next push will be updated to drop the "retains non-raw
splits" claim — only the accumulator-source change is what this PR delivers,
and that part genuinely mirrors Java line-for-line.
10/10 tests pass in `reader_split_generator_test.py`, flake8 clean.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]