JingsongLi commented on code in PR #340:
URL: https://github.com/apache/paimon-rust/pull/340#discussion_r3384978245


##########
crates/paimon/src/table/kv_file_reader.rs:
##########
@@ -328,7 +341,14 @@ impl KeyValueFileReader {
                     user_sequence_indices.clone(),
                     value_indices.clone(),
                     merge_output_schema.clone(),
-                    Self::new_merge_function(merge_engine, &table_options, 
&table_name)?,
+                    Self::new_merge_function(

Review Comment:
   This creates a sort-merge reader inside the per-`DataSplit` loop, so 
aggregation is only applied within each split. `TableScan` can bin-pack files 
from the same partition/bucket into multiple splits, and the same primary key 
can appear in more than one split. In that case this emits one partial 
aggregate per split instead of one globally aggregated row. I reproduced it 
with an aggregation table using `source.split.target-size = 1b` and 
`source.split.open-file-cost = 1b`, then inserting `(1, 10)`, `(1, 20)`, `(1, 
30)` in three commits: `SELECT` returns 3 rows instead of one row with `amount 
= 60`. Aggregation/partial-update reads need either split planning that keeps 
all files for a partition/bucket together, or a second merge stage across split 
outputs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to