I think the point that AI data = “regular” data nowadays and AI data tends to have these large blobs is valid. This seems to be a shift in the industry, e.g. AI observability data = “regular” observability data (https://opentelemetry.io/docs/specs/semconv/gen-ai/) now. I think it’s also true in general that data volumes are growing over time so what might have been 1kB values 5 years ago is not 5mB values.
We can say Parquet is not the right format for that, but I think that diminishes the use cases for Parquet in the future. Putting this data in external files and linking to it from Parquet is doable, but adds a lot of complexity to implementations especially if they want to support queries like `time_col > now() - ‘5 min’ and large_text like ‘%foo%’`. This proposal is interesting but focuses a lot on the read side of things. I’m more interested in the read side of things, but haven’t really explored how big values impact reading. It seems to me that the problem there would be more along the lines of row groups structures which can force inefficient IO patterns with a mix of small id-like columns and large blobs (I think). The point about offloading to local temp files on disk is interesting. For someone running on fast SSDs that might be a viable solution, I’d be interested in trying it on our system. We have this problem but have mostly solved it by flushing more frequently if there are large blobs, that may be hurting us in other ways though... > On May 5, 2026, at 9:39 AM, Andrew Bell <[email protected]> wrote: > > Hi, > > Going out on a limb here, but maybe storing individual values that are > hundreds of megabytes isn't really the best fit for Parquet files. Or at > least this isn't a common-enough use case for shared/public files to > warrant a complicating change in the format. > > Given the requests/proposals of late, I wonder if there isn't good reason > for someone to come up with another file format that is made specifically > to handle rows with tons of columns and/or very large values. > > On Mon, May 4, 2026 at 7:17 PM Daniel Weeks <[email protected]> > wrote: > >> Hey Parquet Devs, >> >> The core problem is writer memory pressure caused by wide schemas and >> asymmetric column sizes. Today a writer must buffer every column chunk in >> memory until a row group is complete, because each column chunk must land >> as a single contiguous byte range. For wide schemas, or schemas mixing >> small fixed-width columns with very large variable-length values, this can >> drive high memory usage even when individual pages are fully encoded, >> compressed, and ready to flush, or it can result in row groups being >> produced at inconsistent or inefficient boundaries. > > > -- > Andrew Bell > [email protected]
