steFaiz commented on PR #8010:
URL: https://github.com/apache/paimon/pull/8010#issuecomment-4562771348

   >  Is there any other system with a similar design for API hierarchy?
   
   Yes. The closest precedent I found is **Petastorm**, an ML data reader for 
Parquet datasets. Its reader API exposes shuffle and distributed sharding at 
the same reader-construction level: seed, shuffle_row_groups, shuffle_rows, 
cur_shard, and shard_count. The main difference is that Petastorm shuffles 
existing Parquet row groups, while this PR derives logical row-count chunks 
from Paimon manifest entries/files and then converts them back to Splits.
   
   Similar API hierarchies also exist in ML input systems such as Hugging Face 
IterableDataset, NVIDIA DALI readers, WebDataset, Mosaic Streaming, Ray Data, 
TensorFlow tf.data, and PyTorch/TorchData. They commonly expose deterministic 
shuffle options and distributed shard/rank options in the 
dataset/reader/input-pipeline layer rather than rewriting the physical dataset.
   
   I think the main advantage of paimon is:
   1. In our current implementation, each chunk is a single file, including 
blobs and structured cols(metadatas). Considering a 500,000,000 image dataset, 
chunk size is 100, there would be 5 million files. In paimon, we set the target 
blob size as 1G, the total file num is less than 30000
   2. The table arch makes it much easier to manage.
   3. We could easily deal with multiple datasets by importing each dataset to 
a single partition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to