Hi Micah, Sorry for the late response, I didn’t notice your reply.
> On 2025. Feb 3., at 8:46, Micah Kornfield <emkornfi...@gmail.com> wrote: > >> >> I think the idea is quite neat -- as I understand your PR basically >> implements a change to the parquet writer that can efficiently detect >> duplication in the data and thus avoid storing it multiple times. Thank you >> for sharing it > > > I might be misunderstanding (only looked at code briefly). But I think > this only does part of this. It attempts to write out pages (maybe row > groups) in such a way that identical data gets written consistently to > their own page/column chunk, it is then up to a different system to > actually do the deduping? Exactly! > > This seems useful for the archival/increment perspective as described in > the linked mailing list thread on Hugging Face's blog post. Yes, especially if someone looks for a “dumb” way of archivation without expensive calculation of the exact difference between two snapshots. > From an > implementation standpoint in Parquet C++, I wonder if it pays (or is > possible) to maybe generalize the concept a little bit further to have a > generic interface for chunking? Yes, I have been thinking of the same. There could be other useful chunking strategies, like fixed-sized chunking to enforce that each columns’ pages contain the same number of elements regardless of their size (by disabling the default max page size based splitting). > > As a future direction it would be interesting to consider if using this > concept could be used to write less repetitive data inside a parquet file > (at a column check level, I don't think this would require a format change, > at a page level it seems like it would. Can you elaborate more on this? > > Nice work Krisztián! Thanks, standing on others’ shoulders! Cheers, Krisztian > > Cheers, > Micah > > On Sat, Feb 1, 2025 at 8:18 AM Andrew Lamb <andrewlam...@gmail.com> wrote: > >> I think the idea is quite neat -- as I understand your PR basically >> implements a change to the parquet writer that can efficiently detect >> duplication in the data and thus avoid storing it multiple times. Thank you >> for sharing it >> >> One comment I have is that I found the name "Content Defined Chunking", >> while technically accurate, to obscure what this feature is. CDC seems to >> describe an implementation detail in my mind. Perhaps it would be better to >> describe the feature with its usecase. Perhaps "auto deduplication" or >> "update optimized files" or something else like that. >> >> I also had a hard time mapping the description in [7] to my data. I didn't >> look at the code, but I didn't understand what an "edit" meant in the >> context (like was the idea that the program updates a value logically "in >> place" in the encoded values? I think if you made what was happening >> clearer from the data model perspective, it might be easier for people to >> understand the potential benefits. >> >> Andrew >> >> [7]: https://github.com/kszucs/de >> >> >> On Tue, Jan 28, 2025 at 1:05 PM Krisztián Szűcs <szucs.kriszt...@gmail.com >>> >> wrote: >> >>> Dear Community, >>> >>> I would like to share recent developments on applying Content Defined >>> Chunking >>> (CDC [1][2]) to Parquet files. CDC is a technique that divides data into >>> variable-sized chunks based on the content of the data itself, rather >> than >>> fixed-size boundaries. This makes it effective for deduplication in >>> content >>> addressable storage systems, like Hugging Face Hub [3] or restic [4]. >>> There was an earlier discussion [5] on the Parquet mailing list about >> this >>> feature, this is a follow-up on the progress made since then. >>> >>> Generally speaking, CDC is more suitable for deduplicating uncompressed >>> row-major data. However, Parquet Format's unique features enable us to >>> apply >>> content-defined chunking effectively on Parquet files as well. Luckily, >>> only >>> the writer needs to be aware of the chunking, the reader can still read >>> the >>> file as a regular Parquet file, no Parquet Format changes are required. >>> >>> One practical example is storing & serving multiple revisions of a >> Parquet >>> file, including appends/insertions/deletions/updates: >>> - Vanilla Parquet (Snappy): The total size of all revisions is 182.6 GiB, >>> and the content addressable storage requires 148.0 GiB. While the storage >>> is able to identiy some common chunks in the parquet files, the >>> deduplication is fairly low. >>> - Parquet with CDC (Snappy): The total size is 178.3 GiB, and the storage >>> requirement is reduced to 75.6 GiB. The parquet files are written with >>> content-defined chunking, hence the deduplication is greatly improved. >>> - Parquet with CDC (ZSTD): The total size is 109.6 GiB, and the storage >>> requirement is reduced to 55.9 GiB showing that the deduplication ratio >>> is greatly improved for both Snappy and ZSTD compressed parquet files. >>> >>> I created a draft implementation [6] for this feature in Parquet C++ and >>> PyArrow and an evaluation tool [7] to (1) better understand the actual >>> changes >>> in the parquet files and (2) to evaluate the deduplication efficiency of >>> various parquet datasets. >>> You can find more details and results in the evaluation tool's repository >>> [7]. >>> >>> I think this feature could be very useful for other projects as well, so >> I >>> am >>> eager to hear the community's feedback. >>> >>> Cross-posting to the Apache Arrow mailing list for better visibility, >>> though >>> please reply to the Apache Parquet mailing list. >>> >>> Regards, Krisztian >>> >>> [1]: https://joshleeb.com/posts/content-defined-chunking.html >>> [2]: >>> >> https://en.wikipedia.org/wiki/Rolling_hash#Gear_fingerprint_and_content-based_chunking_algorithm_FastCDC >>> [3]: >>> >> https://xethub.com/blog/from-files-to-chunks-improving-hf-storage-efficiency >>> [4]: https://restic.net >>> [5]: https://lists.apache.org/list?dev@parquet.apache.org:2024-10:dedupe >>> [6]: https://github.com/apache/arrow/pull/45360 >>> [7]: https://github.com/kszucs/de >>