JingsongLi commented on code in PR #340:
URL: https://github.com/apache/paimon-rust/pull/340#discussion_r3384978245
##########
crates/paimon/src/table/kv_file_reader.rs:
##########
@@ -328,7 +341,14 @@ impl KeyValueFileReader {
user_sequence_indices.clone(),
value_indices.clone(),
merge_output_schema.clone(),
- Self::new_merge_function(merge_engine, &table_options,
&table_name)?,
+ Self::new_merge_function(
Review Comment:
This creates a sort-merge reader inside the per-`DataSplit` loop, so
aggregation is only applied within each split. `TableScan` can bin-pack files
from the same partition/bucket into multiple splits, and the same primary key
can appear in more than one split. In that case this emits one partial
aggregate per split instead of one globally aggregated row. I reproduced it
with an aggregation table using `source.split.target-size = 1b` and
`source.split.open-file-cost = 1b`, then inserting `(1, 10)`, `(1, 20)`, `(1,
30)` in three commits: `SELECT` returns 3 rows instead of one row with `amount
= 60`. Aggregation/partial-update reads need either split planning that keeps
all files for a partition/bucket together, or a second merge stage across split
outputs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]