steFaiz commented on PR #8010: URL: https://github.com/apache/paimon/pull/8010#issuecomment-4562771348
> Is there any other system with a similar design for API hierarchy? Yes. The closest precedent I found is **Petastorm**, an ML data reader for Parquet datasets. Its reader API exposes shuffle and distributed sharding at the same reader-construction level: seed, shuffle_row_groups, shuffle_rows, cur_shard, and shard_count. The main difference is that Petastorm shuffles existing Parquet row groups, while this PR derives logical row-count chunks from Paimon manifest entries/files and then converts them back to Splits. Similar API hierarchies also exist in ML input systems such as Hugging Face IterableDataset, NVIDIA DALI readers, WebDataset, Mosaic Streaming, Ray Data, TensorFlow tf.data, and PyTorch/TorchData. They commonly expose deterministic shuffle options and distributed shard/rank options in the dataset/reader/input-pipeline layer rather than rewriting the physical dataset. I think the main advantage of paimon is: 1. In our current implementation, each chunk is a single file, including blobs and structured cols(metadatas). Considering a 500,000,000 image dataset, chunk size is 100, there would be 5 million files. In paimon, we set the target blob size as 1G, the total file num is less than 30000 2. The table arch makes it much easier to manage. 3. We could easily deal with multiple datasets by importing each dataset to a single partition. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
