TheR1sing3un commented on PR #7742: URL: https://github.com/apache/paimon/pull/7742#issuecomment-4345621738
@XiaoHongbo-Hope you're right — the two earlier tests didn't actually distinguish the buggy and fixed implementations. Both used inputs (same-key-twice on a single bucket) where every split ended up non-raw_convertible, which means the pre-fix loop body never ran and the fallback `return splits` returned everything anyway. Thanks for catching it. I've replaced them with a single, deterministic reproducer that does exercise the bug: - PK table partitioned on `dt`, `bucket=1`. - Partition `p1` — two overlapping writes on the same PK → **non-raw_convertible** split. - Partition `p2` — one write → **raw_convertible** split with `row_count=1`. `PrimaryKeyTableSplitGenerator` walks partitions in order, so the plan is `[non-raw (p1), raw (p2)]`. With `with_limit(1)` the pre-fix loop skips the non-raw split, then immediately early-returns after the raw one — `limited_splits=[raw]`, p1's data is silently dropped. End-to-end check: ``` $ git checkout origin/master -- pypaimon/read/scanner/file_scanner.py $ pytest ...test_limit_drops_non_raw_split_after_raw_budget_is_met FAILED ... AssertionError: 1 != 2 $ git checkout HEAD -- pypaimon/read/scanner/file_scanner.py $ pytest ...test_limit_drops_non_raw_split_after_raw_budget_is_met 1 passed ``` Force-pushed [3b7c7484b](https://github.com/apache/paimon/pull/7742/commits/3b7c7484b) with the new reproducer and an updated commit message / PR description that walks through why the bug requires `[non-raw, raw]` ordering. PTAL. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
