flatbuffer would be the obvious place that would be no compatibility issues with existing readers.
Also: that looks like a large amount of information to capture statistics on. Has anyone approached them yet? On Wed, 9 Oct 2024 at 03:39, Gang Wu <ust...@gmail.com> wrote: > Thanks Antoine for sharing the blog post! > > I skimmed it quickly and it seems that the main issue is the absolute > file offset used by metadata of page and column chunk. It may take a > long time to migrate if we want to replace them with relative offsets > in the current thrift definition. Perhaps it is a good chance to improve > this with the current flatbuffer experiment? > > Best, > Gang > > On Wed, Oct 9, 2024 at 8:51 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > Hello, > > > > The Hugging Face developers published this insightful blog post about > > their attempts to deduplicate Parquet files when they have similar > > contents. They offer a couple suggestions for improvement at the end: > > https://huggingface.co/blog/improve_parquet_dedupe > > > > Regards > > > > Antoine. > > > > > > >