The current flatbuf footer prototype <https://github.com/apache/arrow/pull/43793> has relative offsets for ColumnChunk and ColumnMetaData. If we extend the same to the column headers in the column chunk themselves this would fix problem (1). Alternatively we could avoid writing ColumnChunk and ColumnMetaData since we have that in the footer.
On Wed, Oct 9, 2024 at 2:50 AM Antoine Pitrou <anto...@python.org> wrote: > > Hello, > > The Hugging Face developers published this insightful blog post about > their attempts to deduplicate Parquet files when they have similar > contents. They offer a couple suggestions for improvement at the end: > https://huggingface.co/blog/improve_parquet_dedupe > > Regards > > Antoine. > > >