Re: [PR] [python] support chunk shuffle for planning and 3-layer shuffle for pytorch Dataset [paimon]

via GitHub Wed, 03 Jun 2026 20:29:14 -0700


discivigour commented on PR #8064:
URL: https://github.com/apache/paimon/pull/8064#issuecomment-4618681632


   Thanks for the contribution! I think the DE chunk shuffle path should 
preserve the existing ordering by `max_sequence_number`.
   
   `DataEvolutionSplitGenerator` sorts files by `(first_row_id, is_blob, 
-max_seq)`, so for files in the same row-id group, newer data files are ordered 
before older ones. This order matters because `DataEvolutionSplitRead` assigns 
each requested field to the first matching file/bunch (`row_offsets[j] == -1`), 
so changing the order can change which version of a field is read.
   
   However, `DataEvolutionChunkShuffleSplitGenerator._sort_key()` currently 
sorts by `(partition, bucket, first_row_id, is_special, file_name)` and drops 
`-max_sequence_number`. If a data-evolution table has multiple normal data 
files for the same row-id range containing the same field, chunk shuffle may 
read the older file first depending on file name ordering.
   
   Could we keep the same sequence ordering as the normal DE split generator, 
e.g. include `-entry.file.max_sequence_number` before `file_name`, and add a 
regression test covering multiple files in the same row-id range with 
overlapping fields?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [python] support chunk shuffle for planning and 3-layer shuffle for pytorch Dataset [paimon]

Reply via email to