The PR has several approvals, if anybody still has major concerns, please voice them. I'll plan on merging on Monday.
Thanks, Micah On Fri, Dec 19, 2025 at 10:32 AM Micah Kornfield <[email protected]> wrote: > I also posted a PR updating the implementation status page to reflect that > it's not really file_path which is supported but _metadata files ( > https://github.com/apache/parquet-site/pull/145). I believe only > hyparquet might have support for actually reading external columns. > > Thanks, > Micah > > On Wed, Dec 10, 2025 at 11:49 PM Micah Kornfield <[email protected]> > wrote: > >> Based on a conversation in the sync today, we thought not explicitly >> deprecating the field but providing guidance and documenting that new uses >> for the field should go through a feature addition process is probably a >> good path forward. >> >> I put up https://github.com/apache/parquet-format/pull/542 for a >> straw-man to capture this. >> >> On Tue, Dec 9, 2025 at 6:33 PM Julien Le Dem <[email protected]> wrote: >> >>> IMO Iceberg needs to be aware of Parquet files referencing others so that >>> it can prune older snapshots correctly and not delete parquet files >>> referenced by others when deleting old snapshots. Depending if this is a >>> cross table or within table file to file reference could make it more or >>> less complicated. >>> >>> I could imagine starting with a simple implementation for the write path: >>> "Create table foo using parquet options (reference_original_columns = >>> true) >>> as Select content, extract(content) as metadata from bar" >>> That would be constrained to simple plans that have the scan and the >>> output >>> in the same step (map only) so that rows are in the same order per file. >>> Alternatively "alter table foo add column metadata OPTIONS >>> (column_familly >>> = 'bar')" with a subsequent "update table set metadata=extract(content)" >>> to >>> create those files. >>> >>> (just some random thoughts, I'm sure others have spent more time thinking >>> about this) >>> >>> This doesn't seem that different from the mechanism creating a deletion >>> vector in Iceberg. >>> >>> It could also be seen as a view in iceberg joining on _row_id. >>> >>> This can be a topic in the meeting tomorrow. >>> >>> On Mon, Dec 8, 2025 at 9:02 AM Daniel Weeks <[email protected]> wrote: >>> >>> > Thanks for the context Kenny. That example is very similar to some of >>> the >>> > cases that come up in the multi-modal scenarios. >>> > >>> > I agree that we're in a little bit of a difficult situation due to >>> lack of >>> > existing support, which also leads to Micah's concern that it's a >>> point of >>> > confusion for implementers. >>> > >>> > I would be in favor of adding some additional context to the >>> description >>> > because there are some basic things implementers should do (e.g. >>> validate >>> > that the file path is either not set or set to the current file being >>> read >>> > if they don't support disaggregated column data). While older clients >>> will >>> > likely break if they encounter files written this way, there's almost >>> no >>> > risk that it would result in silent failures or corruption as I suspect >>> > most implementations will read the ranges from the referencing file >>> and not >>> > be able to interpret it. >>> > >>> > Adding a read path is relatively straightforward (at least in the java >>> > implementation for both stream and vectored IO reads), but the write >>> path >>> > is where things get more complicated. >>> > >>> > I think we want to discuss some of these use cases in more detail and >>> see >>> > if they are practical and reasonable. Some cases may make more sense >>> at a >>> > higher-level (like table metadata) while others may make sense to >>> handle at >>> > the file level (like asymmetric column sizes). >>> > >>> > -Dan >>> > >>> > >>> > >>> > On Sun, Dec 7, 2025 at 12:35 PM Kenny Daniel <[email protected]> >>> wrote: >>> > >>> > > Since I was the one who brought up file_path at the sync a couple >>> weeks >>> > > ago, I'll share my thoughts: >>> > > >>> > > I am interested in the file_path field for column chunks because it >>> would >>> > > allow for some extremely efficient data engineering in specific cases >>> > > like *adding >>> > > a column to existing data*. >>> > > >>> > > My use case is LLM data. LLM data is often huge piles of text in >>> parquet >>> > > format (see: all of huggingface, or any llm request/response logs). >>> If I >>> > > have a 400mb source.parquet file, how can I annotate each row with an >>> > added >>> > > "score" column efficiently? I would prefer to not have to copy all >>> 400mb >>> > of >>> > > data just to add a "score" column. It would be slick if I could make >>> a >>> > new >>> > > annotated.parquet file that points to source.parquet for the source >>> > > columns, and then only includes the new "score" column in the >>> > > annotated.parquet file. The source.parquet would remain 400mb, the >>> > > annotated parquet could be ~10kb and incorporate the source data by >>> > > reference. >>> > > >>> > > As the implementor of hyparquet I have conflicting opinions on this >>> > > feature. On the one hand, it's a cool capability, already built into >>> > > parquet. On the other hand... none of the parquet implementations >>> support >>> > > it. Hyparquet has a branch for reading/writing file_path that I used >>> for >>> > > testing. It does work. But I don't want to ship it unless theres at >>> least >>> > > ONE other implementation that supports it (there isn't). >>> > > >>> > > I agree that this would be better implemented at the table format >>> level >>> > > (eg- iceberg). BUT... *iceberg does not support my adding column use >>> > case*! >>> > > The problem is that, despite parquet being a column-oriented format, >>> > > iceberg has no support to efficiently zip a new column with existing >>> > data. >>> > > The only option for "add column" in iceberg would be to *add a column >>> > with >>> > > default values and then re-write every row* (including the heavy text >>> > > data). So iceberg fails to solve my problem at all. >>> > > >>> > > Anyway, I'm fine with deprecating, or not. But I did want to at least >>> > make >>> > > the case that it could serve a purpose that I don't see any other >>> good >>> > way >>> > > of solving at the moment. >>> > > >>> > > Kenny >>> > > >>> > > >>> > > >>> > > On Fri, Dec 5, 2025 at 9:46 PM Micah Kornfield < >>> [email protected]> >>> > > wrote: >>> > > >>> > > > Hi Dan, >>> > > > >>> > > > > However, there are ongoing discussions around multi-modal cases >>> where >>> > > > > either separating large columns (e.g. inline blobs) or appending >>> > column >>> > > > > data without rewriting existing data may leverage this. >>> > > > >>> > > > >>> > > > Do you have any design docs or mailing list discussions you can >>> point >>> > to? >>> > > > >>> > > > I don't feel like leaving this for now while we explore those use >>> cases >>> > > > > would cause any additional confusion/complexity. >>> > > > >>> > > > >>> > > > Agreed, it isn't urgent to clean this up. But having a more >>> concrete >>> > > > timeline would be helpful, this does seem to be a semi-regular >>> source >>> > of >>> > > > confusion for folks, so it would be nice to clean up the loose end. >>> > > > >>> > > > Thanks, >>> > > > Micah >>> > > > >>> > > > On Fri, Dec 5, 2025 at 4:07 PM Daniel Weeks <[email protected]> >>> wrote: >>> > > > >>> > > > > I'd actually prefer that we don't deprecate this field (at least >>> not >>> > > > > immediately). >>> > > > > >>> > > > > Recognizing that we've discussed separating column data into >>> multiple >>> > > > files >>> > > > > for over a decade without any concrete implementations, there are >>> > > > emerging >>> > > > > use cases that may benefit from investing in this feature. >>> > > > > >>> > > > > Many of the use cases in the past have been misaligned (e.g. >>> > separating >>> > > > > column data for security/encryption) and better alternatives >>> > addressed >>> > > > > those scenarios. >>> > > > > >>> > > > > However, there are ongoing discussions around multi-modal cases >>> where >>> > > > > either separating large columns (e.g. inline blobs) or appending >>> > column >>> > > > > data without rewriting existing data may leverage this. >>> > > > > >>> > > > > I don't feel like leaving this for now while we explore those use >>> > cases >>> > > > > would cause any additional confusion/complexity. >>> > > > > >>> > > > > -Dan >>> > > > > >>> > > > > On Thu, Dec 4, 2025 at 9:04 AM Micah Kornfield < >>> > [email protected]> >>> > > > > wrote: >>> > > > > >>> > > > > > > What does "deprecated" entail here? Do we plan to remove this >>> > field >>> > > > > > from the format? Otherwise, is it just documentation? >>> > > > > > >>> > > > > > I was imagining just documentation, since we don't want to >>> break >>> > the >>> > > > > > "_metadata file" use case. >>> > > > > > >>> > > > > > On Thu, Dec 4, 2025 at 8:18 AM Antoine Pitrou < >>> [email protected]> >>> > > > > wrote: >>> > > > > > >>> > > > > > > >>> > > > > > > What does "deprecated" entail here? Do we plan to remove this >>> > field >>> > > > > > > from the format? Otherwise, is it just documentation? >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > On Mon, 1 Dec 2025 12:09:18 -0800 >>> > > > > > > Micah Kornfield <[email protected]> >>> > > > > > > wrote: >>> > > > > > > > This has come up a few times in the sync and other >>> forums. I >>> > > > wanted >>> > > > > to >>> > > > > > > > start the conversation about deprecating file_path >>> > > > > > > > < >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962 >>> > > > > > > > >>> > > > > > > > [1] in the parquet footer. >>> > > > > > > > >>> > > > > > > > Outside of the "_metadata" file index use-case I don't >>> think >>> > this >>> > > > is >>> > > > > > used >>> > > > > > > > or implemented in any reader (effectively a poor man's >>> table >>> > > > format). >>> > > > > > > > >>> > > > > > > > With the rise of file formats, it seems like a reasonable >>> > design >>> > > > > choice >>> > > > > > > to >>> > > > > > > > push complexity of referencing columns across files to the >>> > table >>> > > > > level >>> > > > > > > and >>> > > > > > > > keep parquet focused on single file storage (encodings, >>> > indexing, >>> > > > > etc). >>> > > > > > > > >>> > > > > > > > Implementing this at a file level also can be challenging >>> in >>> > the >>> > > > > > context >>> > > > > > > of >>> > > > > > > > knowing all credentials one might need to read from >>> different >>> > > > objects >>> > > > > > on >>> > > > > > > > object storage? >>> > > > > > > > >>> > > > > > > > Thoughts/Objections? >>> > > > > > > > >>> > > > > > > > Thanks, >>> > > > > > > > Micah >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > [1] >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962 >>> > > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> >>
