Re: Re: Improving Parquet Dedupe

Andrew Lamb Wed, 09 Oct 2024 17:39:08 -0700

I am sorry for the likely dumb question, but I think I am missing something


The blog post says " This means that any modification is likely to rewrite
all the Column headers."

But my understanding of the parquet format is that the ColumnChunks[1] are
stored inline with the RowGroups which are stored in the footer.

Thus I would expect that a parquet deduplication process could copy the
data for each row group memcpy style, and write a new footer with updated
offsets. This doesn't require rewriting the entire file, simply adjusting
offsets and writing a new footer.

Andrew


[1]
https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/src/main/thrift/parquet.thrift#L918


On Wed, Oct 9, 2024 at 1:51 PM Yucheng Low <y...@huggingface.co> wrote:

> Hi,
>
> I am the author of the blog here!
> Happy to answer any questions.
>
> There are a couple of parts, one is regarding relative pointers and a
> second is the row group chunking system (which for performance purposes
> could benefit from being implemented in the C/C++ layer). I am happy to
> help where I can with the latter as that can be done with the current
> Parquet version too.
>
> Thanks,
> Yucheng
>
> On 2024/10/09 15:46:01 Julien Le Dem wrote:
> > I recommended to them that they join the dev list. I think that's the
> > easiest to discuss.
> > IMO, it's a good goal to have relative pointers in the metadata so that a
> > row group doesn't depend on where it is in a file.
> > It looks like some aspects of making the data updates more incremental
> > could leverage Iceberg.
> >
> > On Wed, Oct 9, 2024 at 6:52 AM Antoine Pitrou <an...@python.org> wrote:
> >
> > >
> > > I have a contact at Hugging Face who actually notified me of the blog
> > > post. I can transmit any questions or suggestions if desired.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > On Wed, 9 Oct 2024 11:50:51 +0100
> > > Steve Loughran
> > > <st...@cloudera.com.INVALID> wrote:
> > > > flatbuffer would be the obvious place that would be no compatibility
> > > issues
> > > > with existing readers.
> > > >
> > > > Also: that looks like a large amount of information to capture
> statistics
> > > > on. Has anyone approached them yet?
> > > >
> > > > On Wed, 9 Oct 2024 at 03:39, Gang Wu <
> > > ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> > > >
> > > > > Thanks Antoine for sharing the blog post!
> > > > >
> > > > > I skimmed it quickly and it seems that the main issue is the
> absolute
> > > > > file offset used by metadata of page and column chunk. It may take
> a
> > > > > long time to migrate if we want to replace them with relative
> offsets
> > > > > in the current thrift definition. Perhaps it is a good chance to
> > > improve
> > > > > this with the current flatbuffer experiment?
> > > > >
> > > > > Best,
> > > > > Gang
> > > > >
> > > > > On Wed, Oct 9, 2024 at 8:51 AM Antoine Pitrou <
> > > antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote:
> > > > >
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > The Hugging Face developers published this insightful blog post
> about
> > > > > > their attempts to deduplicate Parquet files when they have
> similar
> > > > > > contents. They offer a couple suggestions for improvement at the
> end:
> > > > > > https://huggingface.co/blog/improve_parquet_dedupe
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > >
> >

Re: Re: Improving Parquet Dedupe

Reply via email to