Re: Content Defined Chunking of Parquet Files

Krisztián Szűcs Thu, 06 Feb 2025 12:15:20 -0800

Hi Micah,

Sorry for the late response, I didn’t notice your reply.


> On 2025. Feb 3., at 8:46, Micah Kornfield <emkornfi...@gmail.com> wrote:
> 
>> 
>> I think the idea is quite neat -- as I understand your PR basically
>> implements a change to the parquet writer that can efficiently detect
>> duplication in the data and thus avoid storing it multiple times. Thank you
>> for sharing it
> 
> 
> I might be misunderstanding (only looked at code  briefly).  But I think
> this only does part of this. It attempts to write out pages (maybe row
> groups) in such a way that identical data gets written consistently to
> their own page/column chunk, it is then up to a different system to
> actually do the deduping?
Exactly!

> 
> This seems useful for the archival/increment perspective as described in
> the linked mailing list thread on Hugging Face's blog post.  
Yes, especially if someone looks for a “dumb” way of archivation without 
expensive calculation of the exact difference between two snapshots.

> From an
> implementation standpoint in Parquet C++, I wonder if it pays (or is
> possible) to maybe generalize the concept a little bit further to have a
> generic interface for chunking?
Yes, I have been thinking of the same. There could be other useful 
chunking strategies, like fixed-sized chunking to enforce that each 
columns’ pages contain the same number of elements regardless 
of their size (by disabling the default max page size based splitting).

> 
> As a future direction it would be interesting to consider if using this
> concept could be used to write less repetitive data inside a parquet file
> (at a column check level, I don't think this would require a format change,
> at a page level it seems like it would.
Can you elaborate more on this?

> 
> Nice work Krisztián!
Thanks, standing on others’ shoulders! 

Cheers, Krisztian

> 
> Cheers,
> Micah
> 
> On Sat, Feb 1, 2025 at 8:18 AM Andrew Lamb <andrewlam...@gmail.com> wrote:
> 
>> I think the idea is quite neat -- as I understand your PR basically
>> implements a change to the parquet writer that can efficiently detect
>> duplication in the data and thus avoid storing it multiple times. Thank you
>> for sharing it
>> 
>> One comment I have is that I found the name "Content Defined Chunking",
>> while technically accurate, to obscure what this feature is. CDC seems to
>> describe an implementation detail in my mind. Perhaps it would be better to
>> describe the feature with its usecase. Perhaps "auto deduplication" or
>> "update optimized files" or something else like that.
>> 
>> I also had a hard time mapping the description in [7] to my data. I didn't
>> look at the code, but I didn't understand what an "edit" meant in the
>> context (like was the idea that the program updates a value logically "in
>> place" in the encoded values? I think if you made what was happening
>> clearer from the data model perspective, it might be easier for people to
>> understand the potential benefits.
>> 
>> Andrew
>> 
>> [7]: https://github.com/kszucs/de
>> 
>> 
>> On Tue, Jan 28, 2025 at 1:05 PM Krisztián Szűcs <szucs.kriszt...@gmail.com
>>> 
>> wrote:
>> 
>>> Dear Community,
>>> 
>>> I would like to share recent developments on applying Content Defined
>>> Chunking
>>> (CDC [1][2]) to Parquet files. CDC is a technique that divides data into
>>> variable-sized chunks based on the content of the data itself, rather
>> than
>>> fixed-size boundaries. This makes it effective for deduplication in
>>> content
>>> addressable storage systems, like Hugging Face Hub [3] or restic [4].
>>> There was an earlier discussion [5] on the Parquet mailing list about
>> this
>>> feature, this is a follow-up on the progress made since then.
>>> 
>>> Generally speaking, CDC is more suitable for deduplicating uncompressed
>>> row-major data. However, Parquet Format's unique features enable us to
>>> apply
>>> content-defined chunking effectively on Parquet files as well. Luckily,
>>> only
>>> the writer needs to be aware of the chunking, the reader can still read
>>> the
>>> file as a regular Parquet file, no Parquet Format changes are required.
>>> 
>>> One practical example is storing & serving multiple revisions of a
>> Parquet
>>> file, including appends/insertions/deletions/updates:
>>> - Vanilla Parquet (Snappy): The total size of all revisions is 182.6 GiB,
>>> and the content addressable storage requires 148.0 GiB. While the storage
>>> is able to identiy some common chunks in the parquet files, the
>>> deduplication is fairly low.
>>> - Parquet with CDC (Snappy): The total size is 178.3 GiB, and the storage
>>> requirement is reduced to 75.6 GiB. The parquet files are written with
>>> content-defined chunking, hence the deduplication is greatly improved.
>>> - Parquet with CDC (ZSTD): The total size is 109.6 GiB, and the storage
>>> requirement is reduced to 55.9 GiB showing that the deduplication ratio
>>> is greatly improved for both Snappy and ZSTD compressed parquet files.
>>> 
>>> I created a draft implementation [6] for this feature in Parquet C++ and
>>> PyArrow and an evaluation tool [7] to (1) better understand the actual
>>> changes
>>> in the parquet files and (2) to evaluate the deduplication efficiency of
>>> various parquet datasets.
>>> You can find more details and results in the evaluation tool's repository
>>> [7].
>>> 
>>> I think this feature could be very useful for other projects as well, so
>> I
>>> am
>>> eager to hear the community's feedback.
>>> 
>>> Cross-posting to the Apache Arrow mailing list for better visibility,
>>> though
>>> please reply to the Apache Parquet mailing list.
>>> 
>>> Regards, Krisztian
>>> 
>>> [1]: https://joshleeb.com/posts/content-defined-chunking.html
>>> [2]:
>>> 
>> https://en.wikipedia.org/wiki/Rolling_hash#Gear_fingerprint_and_content-based_chunking_algorithm_FastCDC
>>> [3]:
>>> 
>> https://xethub.com/blog/from-files-to-chunks-improving-hf-storage-efficiency
>>> [4]: https://restic.net
>>> [5]: https://lists.apache.org/list?dev@parquet.apache.org:2024-10:dedupe
>>> [6]: https://github.com/apache/arrow/pull/45360
>>> [7]: https://github.com/kszucs/de
>>

Re: Content Defined Chunking of Parquet Files

Reply via email to