discivigour commented on PR #8064: URL: https://github.com/apache/paimon/pull/8064#issuecomment-4618681632
Thanks for the contribution! I think the DE chunk shuffle path should preserve the existing ordering by `max_sequence_number`. `DataEvolutionSplitGenerator` sorts files by `(first_row_id, is_blob, -max_seq)`, so for files in the same row-id group, newer data files are ordered before older ones. This order matters because `DataEvolutionSplitRead` assigns each requested field to the first matching file/bunch (`row_offsets[j] == -1`), so changing the order can change which version of a field is read. However, `DataEvolutionChunkShuffleSplitGenerator._sort_key()` currently sorts by `(partition, bucket, first_row_id, is_special, file_name)` and drops `-max_sequence_number`. If a data-evolution table has multiple normal data files for the same row-id range containing the same field, chunk shuffle may read the older file first depending on file name ordering. Could we keep the same sequence ordering as the normal DE split generator, e.g. include `-entry.file.max_sequence_number` before `file_name`, and add a regression test covering multiple files in the same row-id range with overlapping fields? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
