Re: Improving Parquet Dedupe

Steve Loughran Wed, 09 Oct 2024 03:51:41 -0700

flatbuffer would be the obvious place that would be no compatibility issues
with existing readers.


Also: that looks like a large amount of information to capture statistics
on. Has anyone approached them yet?

On Wed, 9 Oct 2024 at 03:39, Gang Wu <ust...@gmail.com> wrote:

> Thanks Antoine for sharing the blog post!
>
> I skimmed it quickly and it seems that the main issue is the absolute
> file offset used by metadata of page and column chunk. It may take a
> long time to migrate if we want to replace them with relative offsets
> in the current thrift definition. Perhaps it is a good chance to improve
> this with the current flatbuffer experiment?
>
> Best,
> Gang
>
> On Wed, Oct 9, 2024 at 8:51 AM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > Hello,
> >
> > The Hugging Face developers published this insightful blog post about
> > their attempts to deduplicate Parquet files when they have similar
> > contents. They offer a couple suggestions for improvement at the end:
> > https://huggingface.co/blog/improve_parquet_dedupe
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>

Re: Improving Parquet Dedupe

Reply via email to