TheR1sing3un commented on code in PR #340:
URL: https://github.com/apache/paimon-rust/pull/340#discussion_r3385148937


##########
crates/paimon/src/table/kv_file_reader.rs:
##########
@@ -328,7 +341,14 @@ impl KeyValueFileReader {
                     user_sequence_indices.clone(),
                     value_indices.clone(),
                     merge_output_schema.clone(),
-                    Self::new_merge_function(merge_engine, &table_options, 
&table_name)?,
+                    Self::new_merge_function(

Review Comment:
   Thanks for the careful repro — confirmed the behavior. I'd like to handle 
this in a separate follow-up PR rather than in this one, for two reasons.
   
     1. It's a pre-existing bug, not something this PR introduces. Both 
Deduplicate and PartialUpdate already go through the same for split in &splits 
{ SortMergeReaderBuilder::new(...) } pattern in kv_file_reader.rs, and the same 
per-split bin-packing in table_scan.rs (split_for_batch(data_files, 
target_split_size, open_file_cost) on the non-data-evolution
     path). With source.split.target-size=1b, dedup/partial-update will also 
miss cross-split merges whenever file key ranges overlap — aggregation just 
makes the symptom more visible.
     2. The fix is scoped to scan planning, not to the aggregation merge 
function. The minimal correctness fix is to keep all files of a (partition, 
bucket) in a single split for PK tables (skip split_for_batch on that path). 
The cleaner fix mirrors Java's MergeTreeSplitGenerator, grouping files into 
sections by key-range overlap. Either way, the change lives in
     table_scan.rs and affects all three PK merge engines, so it really wants 
its own PR + test surface.
   
     Plan: get this PR merged for the aggregation engine itself (which is 
correct under the default source.split.target-size), then open the follow-up PR 
immediately after. Happy to link it back to this thread. Does that work for you?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to