> > > > > As a future direction it would be interesting to consider if using this > > concept could be used to write less repetitive data inside a parquet file > > (at a column check level, I don't think this would require a format > change, > > at a page level it seems like it would. > Can you elaborate more on this?
I was thinking if there were opportunities to dedupe identical data within the file (and I thought this is what Andrew as alluding to in his e-mail). It seems given the current format constraints, this could only be done at a column chunk level (have two columns point to the same physical data). If we wanted to do this at the page level it would require more in depth changes. This case might not add a lot of value in practice, so it is not urgent. Thanks, Micah On Thu, Feb 6, 2025 at 12:15 PM Krisztián Szűcs <szucs.kriszt...@gmail.com> wrote: > Hi Micah, > > Sorry for the late response, I didn’t notice your reply. > > > On 2025. Feb 3., at 8:46, Micah Kornfield <emkornfi...@gmail.com> wrote: > > > >> > >> I think the idea is quite neat -- as I understand your PR basically > >> implements a change to the parquet writer that can efficiently detect > >> duplication in the data and thus avoid storing it multiple times. Thank > you > >> for sharing it > > > > > > I might be misunderstanding (only looked at code briefly). But I think > > this only does part of this. It attempts to write out pages (maybe row > > groups) in such a way that identical data gets written consistently to > > their own page/column chunk, it is then up to a different system to > > actually do the deduping? > Exactly! > > > > > This seems useful for the archival/increment perspective as described in > > the linked mailing list thread on Hugging Face's blog post. > Yes, especially if someone looks for a “dumb” way of archivation without > expensive calculation of the exact difference between two snapshots. > > > From an > > implementation standpoint in Parquet C++, I wonder if it pays (or is > > possible) to maybe generalize the concept a little bit further to have a > > generic interface for chunking? > Yes, I have been thinking of the same. There could be other useful > chunking strategies, like fixed-sized chunking to enforce that each > columns’ pages contain the same number of elements regardless > of their size (by disabling the default max page size based splitting). > > > > > As a future direction it would be interesting to consider if using this > > concept could be used to write less repetitive data inside a parquet file > > (at a column check level, I don't think this would require a format > change, > > at a page level it seems like it would. > Can you elaborate more on this? > > > > > Nice work Krisztián! > Thanks, standing on others’ shoulders! > > Cheers, Krisztian > > > > > Cheers, > > Micah > > > > On Sat, Feb 1, 2025 at 8:18 AM Andrew Lamb <andrewlam...@gmail.com> > wrote: > > > >> I think the idea is quite neat -- as I understand your PR basically > >> implements a change to the parquet writer that can efficiently detect > >> duplication in the data and thus avoid storing it multiple times. Thank > you > >> for sharing it > >> > >> One comment I have is that I found the name "Content Defined Chunking", > >> while technically accurate, to obscure what this feature is. CDC seems > to > >> describe an implementation detail in my mind. Perhaps it would be > better to > >> describe the feature with its usecase. Perhaps "auto deduplication" or > >> "update optimized files" or something else like that. > >> > >> I also had a hard time mapping the description in [7] to my data. I > didn't > >> look at the code, but I didn't understand what an "edit" meant in the > >> context (like was the idea that the program updates a value logically > "in > >> place" in the encoded values? I think if you made what was happening > >> clearer from the data model perspective, it might be easier for people > to > >> understand the potential benefits. > >> > >> Andrew > >> > >> [7]: https://github.com/kszucs/de > >> > >> > >> On Tue, Jan 28, 2025 at 1:05 PM Krisztián Szűcs < > szucs.kriszt...@gmail.com > >>> > >> wrote: > >> > >>> Dear Community, > >>> > >>> I would like to share recent developments on applying Content Defined > >>> Chunking > >>> (CDC [1][2]) to Parquet files. CDC is a technique that divides data > into > >>> variable-sized chunks based on the content of the data itself, rather > >> than > >>> fixed-size boundaries. This makes it effective for deduplication in > >>> content > >>> addressable storage systems, like Hugging Face Hub [3] or restic [4]. > >>> There was an earlier discussion [5] on the Parquet mailing list about > >> this > >>> feature, this is a follow-up on the progress made since then. > >>> > >>> Generally speaking, CDC is more suitable for deduplicating uncompressed > >>> row-major data. However, Parquet Format's unique features enable us to > >>> apply > >>> content-defined chunking effectively on Parquet files as well. Luckily, > >>> only > >>> the writer needs to be aware of the chunking, the reader can still read > >>> the > >>> file as a regular Parquet file, no Parquet Format changes are required. > >>> > >>> One practical example is storing & serving multiple revisions of a > >> Parquet > >>> file, including appends/insertions/deletions/updates: > >>> - Vanilla Parquet (Snappy): The total size of all revisions is 182.6 > GiB, > >>> and the content addressable storage requires 148.0 GiB. While the > storage > >>> is able to identiy some common chunks in the parquet files, the > >>> deduplication is fairly low. > >>> - Parquet with CDC (Snappy): The total size is 178.3 GiB, and the > storage > >>> requirement is reduced to 75.6 GiB. The parquet files are written with > >>> content-defined chunking, hence the deduplication is greatly improved. > >>> - Parquet with CDC (ZSTD): The total size is 109.6 GiB, and the storage > >>> requirement is reduced to 55.9 GiB showing that the deduplication ratio > >>> is greatly improved for both Snappy and ZSTD compressed parquet files. > >>> > >>> I created a draft implementation [6] for this feature in Parquet C++ > and > >>> PyArrow and an evaluation tool [7] to (1) better understand the actual > >>> changes > >>> in the parquet files and (2) to evaluate the deduplication efficiency > of > >>> various parquet datasets. > >>> You can find more details and results in the evaluation tool's > repository > >>> [7]. > >>> > >>> I think this feature could be very useful for other projects as well, > so > >> I > >>> am > >>> eager to hear the community's feedback. > >>> > >>> Cross-posting to the Apache Arrow mailing list for better visibility, > >>> though > >>> please reply to the Apache Parquet mailing list. > >>> > >>> Regards, Krisztian > >>> > >>> [1]: https://joshleeb.com/posts/content-defined-chunking.html > >>> [2]: > >>> > >> > https://en.wikipedia.org/wiki/Rolling_hash#Gear_fingerprint_and_content-based_chunking_algorithm_FastCDC > >>> [3]: > >>> > >> > https://xethub.com/blog/from-files-to-chunks-improving-hf-storage-efficiency > >>> [4]: https://restic.net > >>> [5]: > https://lists.apache.org/list?dev@parquet.apache.org:2024-10:dedupe > >>> [6]: https://github.com/apache/arrow/pull/45360 > >>> [7]: https://github.com/kszucs/de > >> > >