> > I think the idea is quite neat -- as I understand your PR basically > implements a change to the parquet writer that can efficiently detect > duplication in the data and thus avoid storing it multiple times. Thank you > for sharing it
I might be misunderstanding (only looked at code briefly). But I think this only does part of this. It attempts to write out pages (maybe row groups) in such a way that identical data gets written consistently to their own page/column chunk, it is then up to a different system to actually do the deduping? This seems useful for the archival/increment perspective as described in the linked mailing list thread on Hugging Face's blog post. From an implementation standpoint in Parquet C++, I wonder if it pays (or is possible) to maybe generalize the concept a little bit further to have a generic interface for chunking? As a future direction it would be interesting to consider if using this concept could be used to write less repetitive data inside a parquet file (at a column check level, I don't think this would require a format change, at a page level it seems like it would. Nice work Krisztián! Cheers, Micah On Sat, Feb 1, 2025 at 8:18 AM Andrew Lamb <andrewlam...@gmail.com> wrote: > I think the idea is quite neat -- as I understand your PR basically > implements a change to the parquet writer that can efficiently detect > duplication in the data and thus avoid storing it multiple times. Thank you > for sharing it > > One comment I have is that I found the name "Content Defined Chunking", > while technically accurate, to obscure what this feature is. CDC seems to > describe an implementation detail in my mind. Perhaps it would be better to > describe the feature with its usecase. Perhaps "auto deduplication" or > "update optimized files" or something else like that. > > I also had a hard time mapping the description in [7] to my data. I didn't > look at the code, but I didn't understand what an "edit" meant in the > context (like was the idea that the program updates a value logically "in > place" in the encoded values? I think if you made what was happening > clearer from the data model perspective, it might be easier for people to > understand the potential benefits. > > Andrew > > [7]: https://github.com/kszucs/de > > > On Tue, Jan 28, 2025 at 1:05 PM Krisztián Szűcs <szucs.kriszt...@gmail.com > > > wrote: > > > Dear Community, > > > > I would like to share recent developments on applying Content Defined > > Chunking > > (CDC [1][2]) to Parquet files. CDC is a technique that divides data into > > variable-sized chunks based on the content of the data itself, rather > than > > fixed-size boundaries. This makes it effective for deduplication in > > content > > addressable storage systems, like Hugging Face Hub [3] or restic [4]. > > There was an earlier discussion [5] on the Parquet mailing list about > this > > feature, this is a follow-up on the progress made since then. > > > > Generally speaking, CDC is more suitable for deduplicating uncompressed > > row-major data. However, Parquet Format's unique features enable us to > > apply > > content-defined chunking effectively on Parquet files as well. Luckily, > > only > > the writer needs to be aware of the chunking, the reader can still read > > the > > file as a regular Parquet file, no Parquet Format changes are required. > > > > One practical example is storing & serving multiple revisions of a > Parquet > > file, including appends/insertions/deletions/updates: > > - Vanilla Parquet (Snappy): The total size of all revisions is 182.6 GiB, > > and the content addressable storage requires 148.0 GiB. While the storage > > is able to identiy some common chunks in the parquet files, the > > deduplication is fairly low. > > - Parquet with CDC (Snappy): The total size is 178.3 GiB, and the storage > > requirement is reduced to 75.6 GiB. The parquet files are written with > > content-defined chunking, hence the deduplication is greatly improved. > > - Parquet with CDC (ZSTD): The total size is 109.6 GiB, and the storage > > requirement is reduced to 55.9 GiB showing that the deduplication ratio > > is greatly improved for both Snappy and ZSTD compressed parquet files. > > > > I created a draft implementation [6] for this feature in Parquet C++ and > > PyArrow and an evaluation tool [7] to (1) better understand the actual > > changes > > in the parquet files and (2) to evaluate the deduplication efficiency of > > various parquet datasets. > > You can find more details and results in the evaluation tool's repository > > [7]. > > > > I think this feature could be very useful for other projects as well, so > I > > am > > eager to hear the community's feedback. > > > > Cross-posting to the Apache Arrow mailing list for better visibility, > > though > > please reply to the Apache Parquet mailing list. > > > > Regards, Krisztian > > > > [1]: https://joshleeb.com/posts/content-defined-chunking.html > > [2]: > > > https://en.wikipedia.org/wiki/Rolling_hash#Gear_fingerprint_and_content-based_chunking_algorithm_FastCDC > > [3]: > > > https://xethub.com/blog/from-files-to-chunks-improving-hf-storage-efficiency > > [4]: https://restic.net > > [5]: https://lists.apache.org/list?d...@parquet.apache.org:2024-10:dedupe > > [6]: https://github.com/apache/arrow/pull/45360 > > [7]: https://github.com/kszucs/de >