I recommended to them that they join the dev list. I think that's the easiest to discuss. IMO, it's a good goal to have relative pointers in the metadata so that a row group doesn't depend on where it is in a file. It looks like some aspects of making the data updates more incremental could leverage Iceberg.
On Wed, Oct 9, 2024 at 6:52 AM Antoine Pitrou <anto...@python.org> wrote: > > I have a contact at Hugging Face who actually notified me of the blog > post. I can transmit any questions or suggestions if desired. > > Regards > > Antoine. > > > On Wed, 9 Oct 2024 11:50:51 +0100 > Steve Loughran > <ste...@cloudera.com.INVALID> wrote: > > flatbuffer would be the obvious place that would be no compatibility > issues > > with existing readers. > > > > Also: that looks like a large amount of information to capture statistics > > on. Has anyone approached them yet? > > > > On Wed, 9 Oct 2024 at 03:39, Gang Wu < > ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote: > > > > > Thanks Antoine for sharing the blog post! > > > > > > I skimmed it quickly and it seems that the main issue is the absolute > > > file offset used by metadata of page and column chunk. It may take a > > > long time to migrate if we want to replace them with relative offsets > > > in the current thrift definition. Perhaps it is a good chance to > improve > > > this with the current flatbuffer experiment? > > > > > > Best, > > > Gang > > > > > > On Wed, Oct 9, 2024 at 8:51 AM Antoine Pitrou < > antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote: > > > > > > > > > > > Hello, > > > > > > > > The Hugging Face developers published this insightful blog post about > > > > their attempts to deduplicate Parquet files when they have similar > > > > contents. They offer a couple suggestions for improvement at the end: > > > > https://huggingface.co/blog/improve_parquet_dedupe > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > >