I have a contact at Hugging Face who actually notified me of the blog
post. I can transmit any questions or suggestions if desired.

Regards

Antoine.


On Wed, 9 Oct 2024 11:50:51 +0100
Steve Loughran
<ste...@cloudera.com.INVALID> wrote:
> flatbuffer would be the obvious place that would be no compatibility issues
> with existing readers.
> 
> Also: that looks like a large amount of information to capture statistics
> on. Has anyone approached them yet?
> 
> On Wed, 9 Oct 2024 at 03:39, Gang Wu 
> <ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> 
> > Thanks Antoine for sharing the blog post!
> >
> > I skimmed it quickly and it seems that the main issue is the absolute
> > file offset used by metadata of page and column chunk. It may take a
> > long time to migrate if we want to replace them with relative offsets
> > in the current thrift definition. Perhaps it is a good chance to improve
> > this with the current flatbuffer experiment?
> >
> > Best,
> > Gang
> >
> > On Wed, Oct 9, 2024 at 8:51 AM Antoine Pitrou 
> > <antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote:
> >  
> > >
> > > Hello,
> > >
> > > The Hugging Face developers published this insightful blog post about
> > > their attempts to deduplicate Parquet files when they have similar
> > > contents. They offer a couple suggestions for improvement at the end:
> > > https://huggingface.co/blog/improve_parquet_dedupe
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >  
> >  
> 



Reply via email to