TheR1sing3un commented on code in PR #340:
URL: https://github.com/apache/paimon-rust/pull/340#discussion_r3385148937
##########
crates/paimon/src/table/kv_file_reader.rs:
##########
@@ -328,7 +341,14 @@ impl KeyValueFileReader {
user_sequence_indices.clone(),
value_indices.clone(),
merge_output_schema.clone(),
- Self::new_merge_function(merge_engine, &table_options,
&table_name)?,
+ Self::new_merge_function(
Review Comment:
Thanks for the careful repro — confirmed the behavior. I'd like to handle
this in a separate follow-up PR rather than in this one, for two reasons.
1. It's a pre-existing bug, not something this PR introduces. Both
Deduplicate and PartialUpdate already go through the same for split in &splits
{ SortMergeReaderBuilder::new(...) } pattern in kv_file_reader.rs, and the same
per-split bin-packing in table_scan.rs (split_for_batch(data_files,
target_split_size, open_file_cost) on the non-data-evolution
path). With source.split.target-size=1b, dedup/partial-update will also
miss cross-split merges whenever file key ranges overlap — aggregation just
makes the symptom more visible.
2. The fix is scoped to scan planning, not to the aggregation merge
function. The minimal correctness fix is to keep all files of a (partition,
bucket) in a single split for PK tables (skip split_for_batch on that path).
The cleaner fix mirrors Java's MergeTreeSplitGenerator, grouping files into
sections by key-range overlap. Either way, the change lives in
table_scan.rs and affects all three PK merge engines, so it really wants
its own PR + test surface.
Plan: get this PR merged for the aggregation engine itself (which is
correct under the default source.split.target-size), then open the follow-up PR
immediately after. Happy to link it back to this thread. Does that work for you?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]