kszucs commented on PR #9450:
URL: https://github.com/apache/arrow-rs/pull/9450#issuecomment-4085525577

   > > BTW using this feature anyone could implement a "parquet page store" 
storing only unique parquet pages and some metadata to reassemble the parquet 
files.
   > 
   > Is this easy to show? I realize this is an important usecase for hugging 
face, but it would be nice to have some example how this could be used by 
others that are not using the xet filesystem
   
   I have been thinking of a page store prototype for a while actually, that 
would kinda look like:
   1. iterate over the parquet pages using a page reader
   2. use a hash function to assign a unique key to the page based on its 
content, like xxhash, shar, blake (this is different from the gearhash since 
chunking is already done by the parquet writer)
   3. write out the page to a hashtable like storage system like kv store, 
object store, but really depends on the use case
   4. maintain the necessary metadata to reassemble the original parquet file 
from the stored pages
   
   A format agnostic CAS is different since it does the chunking on the byte 
stream directly. I have a naive and very simple implementation for that here 
https://github.com/huggingface/dataset-dedupe-estimator/blob/main/src/store.rs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to