Hey Parquet Devs, I would like to introduce a proposal that addresses the issues arising from the physical layout requirements in the Parquet format that necessitate contiguous data for columnar data.
Over the years, several improvements were introduced to solve other challenges, effectively capturing the necessary information for Parquet to lift the contiguity requirement on pages and column chunks. Other formats recognize these challenges and embrace a model where individual column segments are tracked at the metadata level but do not rely on physical contiguity in the file. The core problem is writer memory pressure caused by wide schemas and asymmetric column sizes. Today a writer must buffer every column chunk in memory until a row group is complete, because each column chunk must land as a single contiguous byte range. For wide schemas, or schemas mixing small fixed-width columns with very large variable-length values, this can drive high memory usage even when individual pages are fully encoded, compressed, and ready to flush, or it can result in row groups being produced at inconsistent or inefficient boundaries. This characteristic is more pronounced for emerging AI/ML use cases that rely on data types and sizes atypical for traditional analytic use cases. The document linked below includes a comprehensive proposal. Looking forward to your feedback. Proposal: https://docs.google.com/document/d/1nntcYM98PFSkHT70RexSBPtCnWqg1uRJ5_7m--ZgbsA Thanks, Dan
