steFaiz opened a new pull request, #8010:
URL: https://github.com/apache/paimon/pull/8010

   ### Purpose
   #### Background
   This PR originates from our inner cases: when using paimon table as 
dataloaders, engine training always needs deterministically shuffled data 
rather than sequential data.
   It's highly expensive to shuffle the entire dataset for each training, so a 
common way is:
   1. When loading data into paimon, perform a global row-level shuffle
   2. During training, adapt chunk shuffle rather than row-level shuffle.
   3. More sophisticated shuffle, e.g. read several chunks and do row-level 
shuffle among them.
   
   We can provide 1 & 2. This PR introduces a chunk shuffle for pypaimon. The 
mechanism can be illustrated as below:
   
   <img width="700" height="500" alt="image" 
src="https://github.com/user-attachments/assets/8ec15fd6-eba4-4b98-a0e9-3fbd48cb0205";
 />
   
   1. we logically divide data files into chunks. The most simple case:
      ```python
      Chunk 1 {
          file: file1,
         range: [0, 100]
      }
      ```
      This means the chunk is in file1, covering [0, 100] rows
   2. deterministically shuffle chunks
   3. wraps each chunk to a single `SliceSplit`
   
   #### Usage
   The usage is simple:
   ```python
   readBuilder.new_scan().with_chunk_shuffle(chunk_size, seed).with_shard(idx, 
total_worker);
   ```
   
   ### Tests
   UnitTests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to