alamb opened a new issue, #9592:
URL: https://github.com/apache/arrow-rs/issues/9592

   @kszucs has added support for Content Addressable Chunking in 
https://github.com/apache/arrow-rs/pull/9450 ❤️ 
   
   This is an important feature for hugging face's xet filesystem which 
automatically deduplicates multiple copies of the same data. 
   
   I think however, it is a much more interesting usecase more broadly and 
would be applicable to many users, for example, those who "compact" data on 
object store stored in parquet files. The compacted versions often share a 
substantial number of similar bytes / pages, but currently they aren't 
typically deduped that I know of
   
   To help others understand more easily how to take advantage of this feature, 
I think it would help to have simple working example showing how to make such a 
Content Addressable Storage system
   
   
   > > BTW using this feature anyone could implement a "parquet page store" 
storing only unique parquet pages and some metadata to reassemble the parquet 
files.
   > 
   > Is this easy to show? I realize this is an important usecase for hugging 
face, but it would be nice to have some example how this could be used by 
others that are not using the xet filesystem
   
   I have been thinking of a page store prototype for a while actually, that 
would kinda look like:
   1. iterate over the parquet pages using a page reader
   2. use a hash function to assign a unique key to the page based on its 
content, like xxhash, shar, blake (this is different from the gearhash since 
chunking is already done by the parquet writer)
   3. write out the page to a hashtable like storage system like kv store, 
object store, but really depends on the use case
   4. maintain the necessary metadata to reassemble the original parquet file 
from the stored pages
   
   A format agnostic CAS is different since it does the chunking on the byte 
stream directly. I have a naive and very simple implementation for that here 
https://github.com/huggingface/dataset-dedupe-estimator/blob/main/src/store.rs
   
   _Originally posted by @kszucs in 
https://github.com/apache/arrow-rs/issues/9450#issuecomment-4085525577_
               


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to