alamb commented on PR #9450: URL: https://github.com/apache/arrow-rs/pull/9450#issuecomment-4098120791
> > > BTW using this feature anyone could implement a "parquet page store" storing only unique parquet pages and some metadata to reassemble the parquet files. > > > > > > Is this easy to show? I realize this is an important usecase for hugging face, but it would be nice to have some example how this could be used by others that are not using the xet filesystem > > I have been thinking of a page store prototype for a while actually, that would kinda look like: > > 1. iterate over the parquet pages using a page reader > 2. use a hash function to assign a unique key to the page based on its content, like xxhash, shar, blake (this is different from the gearhash since chunking is already done by the parquet writer) > 3. write out the page to a hashtable like storage system like kv store, object store, but really depends on the use case > 4. maintain the necessary metadata to reassemble the original parquet file from the stored pages > > A format agnostic CAS is different since it does the chunking on the byte stream directly. I have a naive and very simple implementation for that here https://github.com/huggingface/dataset-dedupe-estimator/blob/main/src/store.rs I filed a ticket to track this idea so it doesn't get lost on a old PR - https://github.com/apache/arrow-rs/issues/9592 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
