steFaiz opened a new pull request, #8010: URL: https://github.com/apache/paimon/pull/8010
### Purpose #### Background This PR originates from our inner cases: when using paimon table as dataloaders, engine training always needs deterministically shuffled data rather than sequential data. It's highly expensive to shuffle the entire dataset for each training, so a common way is: 1. When loading data into paimon, perform a global row-level shuffle 2. During training, adapt chunk shuffle rather than row-level shuffle. 3. More sophisticated shuffle, e.g. read several chunks and do row-level shuffle among them. We can provide 1 & 2. This PR introduces a chunk shuffle for pypaimon. The mechanism can be illustrated as below: <img width="700" height="500" alt="image" src="https://github.com/user-attachments/assets/8ec15fd6-eba4-4b98-a0e9-3fbd48cb0205" /> 1. we logically divide data files into chunks. The most simple case: ```python Chunk 1 { file: file1, range: [0, 100] } ``` This means the chunk is in file1, covering [0, 100] rows 2. deterministically shuffle chunks 3. wraps each chunk to a single `SliceSplit` #### Usage The usage is simple: ```python readBuilder.new_scan().with_chunk_shuffle(chunk_size, seed).with_shard(idx, total_worker); ``` ### Tests UnitTests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
