>
> I think the idea is quite neat -- as I understand your PR basically
> implements a change to the parquet writer that can efficiently detect
> duplication in the data and thus avoid storing it multiple times. Thank you
> for sharing it


I might be misunderstanding (only looked at code  briefly).  But I think
this only does part of this. It attempts to write out pages (maybe row
groups) in such a way that identical data gets written consistently to
their own page/column chunk, it is then up to a different system to
actually do the deduping?

This seems useful for the archival/increment perspective as described in
the linked mailing list thread on Hugging Face's blog post.  From an
implementation standpoint in Parquet C++, I wonder if it pays (or is
possible) to maybe generalize the concept a little bit further to have a
generic interface for chunking?

As a future direction it would be interesting to consider if using this
concept could be used to write less repetitive data inside a parquet file
(at a column check level, I don't think this would require a format change,
at a page level it seems like it would.

Nice work Krisztián!

Cheers,
Micah

On Sat, Feb 1, 2025 at 8:18 AM Andrew Lamb <andrewlam...@gmail.com> wrote:

> I think the idea is quite neat -- as I understand your PR basically
> implements a change to the parquet writer that can efficiently detect
> duplication in the data and thus avoid storing it multiple times. Thank you
> for sharing it
>
> One comment I have is that I found the name "Content Defined Chunking",
> while technically accurate, to obscure what this feature is. CDC seems to
> describe an implementation detail in my mind. Perhaps it would be better to
> describe the feature with its usecase. Perhaps "auto deduplication" or
> "update optimized files" or something else like that.
>
> I also had a hard time mapping the description in [7] to my data. I didn't
> look at the code, but I didn't understand what an "edit" meant in the
> context (like was the idea that the program updates a value logically "in
> place" in the encoded values? I think if you made what was happening
> clearer from the data model perspective, it might be easier for people to
> understand the potential benefits.
>
> Andrew
>
> [7]: https://github.com/kszucs/de
>
>
> On Tue, Jan 28, 2025 at 1:05 PM Krisztián Szűcs <szucs.kriszt...@gmail.com
> >
> wrote:
>
> > Dear Community,
> >
> > I would like to share recent developments on applying Content Defined
> > Chunking
> > (CDC [1][2]) to Parquet files. CDC is a technique that divides data into
> > variable-sized chunks based on the content of the data itself, rather
> than
> > fixed-size boundaries. This makes it effective for deduplication in
> > content
> > addressable storage systems, like Hugging Face Hub [3] or restic [4].
> > There was an earlier discussion [5] on the Parquet mailing list about
> this
> > feature, this is a follow-up on the progress made since then.
> >
> > Generally speaking, CDC is more suitable for deduplicating uncompressed
> > row-major data. However, Parquet Format's unique features enable us to
> > apply
> > content-defined chunking effectively on Parquet files as well. Luckily,
> > only
> > the writer needs to be aware of the chunking, the reader can still read
> > the
> > file as a regular Parquet file, no Parquet Format changes are required.
> >
> > One practical example is storing & serving multiple revisions of a
> Parquet
> > file, including appends/insertions/deletions/updates:
> > - Vanilla Parquet (Snappy): The total size of all revisions is 182.6 GiB,
> > and the content addressable storage requires 148.0 GiB. While the storage
> > is able to identiy some common chunks in the parquet files, the
> > deduplication is fairly low.
> > - Parquet with CDC (Snappy): The total size is 178.3 GiB, and the storage
> > requirement is reduced to 75.6 GiB. The parquet files are written with
> > content-defined chunking, hence the deduplication is greatly improved.
> > - Parquet with CDC (ZSTD): The total size is 109.6 GiB, and the storage
> > requirement is reduced to 55.9 GiB showing that the deduplication ratio
> > is greatly improved for both Snappy and ZSTD compressed parquet files.
> >
> > I created a draft implementation [6] for this feature in Parquet C++ and
> > PyArrow and an evaluation tool [7] to (1) better understand the actual
> > changes
> > in the parquet files and (2) to evaluate the deduplication efficiency of
> > various parquet datasets.
> > You can find more details and results in the evaluation tool's repository
> > [7].
> >
> > I think this feature could be very useful for other projects as well, so
> I
> > am
> > eager to hear the community's feedback.
> >
> > Cross-posting to the Apache Arrow mailing list for better visibility,
> > though
> > please reply to the Apache Parquet mailing list.
> >
> > Regards, Krisztian
> >
> > [1]: https://joshleeb.com/posts/content-defined-chunking.html
> > [2]:
> >
> https://en.wikipedia.org/wiki/Rolling_hash#Gear_fingerprint_and_content-based_chunking_algorithm_FastCDC
> > [3]:
> >
> https://xethub.com/blog/from-files-to-chunks-improving-hf-storage-efficiency
> > [4]: https://restic.net
> > [5]: https://lists.apache.org/list?d...@parquet.apache.org:2024-10:dedupe
> > [6]: https://github.com/apache/arrow/pull/45360
> > [7]: https://github.com/kszucs/de
>

Reply via email to