Re: Improving Parquet Dedupe

Julien Le Dem Wed, 09 Oct 2024 08:47:05 -0700

I recommended to them that they join the dev list. I think that's the
easiest to discuss.
IMO, it's a good goal to have relative pointers in the metadata so that a
row group doesn't depend on where it is in a file.
It looks like some aspects of making the data updates more incremental
could leverage Iceberg.


On Wed, Oct 9, 2024 at 6:52 AM Antoine Pitrou <anto...@python.org> wrote:

>
> I have a contact at Hugging Face who actually notified me of the blog
> post. I can transmit any questions or suggestions if desired.
>
> Regards
>
> Antoine.
>
>
> On Wed, 9 Oct 2024 11:50:51 +0100
> Steve Loughran
> <ste...@cloudera.com.INVALID> wrote:
> > flatbuffer would be the obvious place that would be no compatibility
> issues
> > with existing readers.
> >
> > Also: that looks like a large amount of information to capture statistics
> > on. Has anyone approached them yet?
> >
> > On Wed, 9 Oct 2024 at 03:39, Gang Wu <
> ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> >
> > > Thanks Antoine for sharing the blog post!
> > >
> > > I skimmed it quickly and it seems that the main issue is the absolute
> > > file offset used by metadata of page and column chunk. It may take a
> > > long time to migrate if we want to replace them with relative offsets
> > > in the current thrift definition. Perhaps it is a good chance to
> improve
> > > this with the current flatbuffer experiment?
> > >
> > > Best,
> > > Gang
> > >
> > > On Wed, Oct 9, 2024 at 8:51 AM Antoine Pitrou <
> antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > The Hugging Face developers published this insightful blog post about
> > > > their attempts to deduplicate Parquet files when they have similar
> > > > contents. They offer a couple suggestions for improvement at the end:
> > > > https://huggingface.co/blog/improve_parquet_dedupe
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > >
> > >
> >
>
>
>
>

Re: Improving Parquet Dedupe

Reply via email to