Hi Daniel,

Interesting problem, it's good that we are thinking about this.

Memory pressure is definitely a problem, particularly for row-wise writers
that must buffer all the column chunks for the row group before writing
them.

Trivia: Most open-source column-wise writers I've seen also buffer all the
column chunks for the row group before writing them, although they strictly
don't need to.  DuckDB has a really interesting hybrid that serializes
eight columns at a time, which I'd really like to understand better.
Arrow-cpp's WriteTable processes one column at a time, offering memory
advantages.

Also, many modern writers split into row groups based on row count instead
of byte size; some use compressed size, while others use uncompressed
size.  They generally aren't ready for very large values in certain columns
within some rows.

Of course the row-group-size is not just about the writer's memory
management; it's a contract with the reader regarding the memory
requirements the reader will need.  Parquet writers that generate very
large row groups efficiently can easily create Parquet files that even
mainstream readers with ample RAM cannot manage to read.

One implementation detail concerns using `data_page_offset = -1` as a
marker.  Most open source readers I've seen fail (often with an I/O error)
if you access that column.  Arrow rust will actually panic if any footer
columns have data_page_offset = -1, even if that column isn't in the
projection (and panic can map to sigabrt on Rust; it's a compiler
setting).  DuckDB silently corrupts data.

Other approaches might exist.  For example, there might be a new DataPageV3
(yes, the point is for readers who don't know to fail when they encounter
it!).  DataPageV3 doesn't store data inline, instead, it contains the
OffsetIndex. (The expectation that OffsetIndex sits outside row
groups—which seems a strange insistence in the spec—is relaxed).  Engines
unaware of DataPageV3 that use offset indexes to always seek will get the
real data.  Most don't, though, so fail.

It's easy to imagine future Parquet files that have one big row group and
discontinuous column chunks.  It's basically half way to Lance?

Many engines today pay no attention to page indexes, or if they do, they
use them for stats rather than for seeking.  The new discontinuous column
chunks might force a significant architectural change for those engines.

Fun thinking about this kind of stuff,
Will


On Tue, 5 May 2026 at 01:18, Daniel Weeks <[email protected]> wrote:

> Hey Parquet Devs,
>
> I would like to introduce a proposal that addresses the issues arising from
> the physical layout requirements in the Parquet format that necessitate
> contiguous data for columnar data.
>
> Over the years, several improvements were introduced to solve other
> challenges, effectively capturing the necessary information for Parquet to
> lift the contiguity requirement on pages and column chunks.
>
> Other formats recognize these challenges and embrace a model where
> individual column segments are tracked at the metadata level but do not
> rely on physical contiguity in the file.
>
> The core problem is writer memory pressure caused by wide schemas and
> asymmetric column sizes. Today a writer must buffer every column chunk in
> memory until a row group is complete, because each column chunk must land
> as a single contiguous byte range. For wide schemas, or schemas mixing
> small fixed-width columns with very large variable-length values, this can
> drive high memory usage even when individual pages are fully encoded,
> compressed, and ready to flush, or it can result in row groups being
> produced at inconsistent or inefficient boundaries.
>
> This characteristic is more pronounced for emerging AI/ML use cases that
> rely on data types and sizes atypical for traditional analytic use cases.
>
> The document linked below includes a comprehensive proposal. Looking
> forward to your feedback.
>
> Proposal:
>
> https://docs.google.com/document/d/1nntcYM98PFSkHT70RexSBPtCnWqg1uRJ5_7m--ZgbsA
>
> Thanks,
> Dan
>

Reply via email to