Re: [DISCUSS] Deprecate file_path field in column chunk

Micah Kornfield Fri, 30 Jan 2026 21:48:23 -0800

The PR has several approvals, if anybody still has major concerns, please
voice them.  I'll plan on merging on Monday.


Thanks,
Micah

On Fri, Dec 19, 2025 at 10:32 AM Micah Kornfield <[email protected]>
wrote:

> I also posted a PR updating the implementation status page to reflect that
> it's not really file_path which is supported but _metadata files (
> https://github.com/apache/parquet-site/pull/145).  I believe only
> hyparquet might have support for actually reading external columns.
>
> Thanks,
> Micah
>
> On Wed, Dec 10, 2025 at 11:49 PM Micah Kornfield <[email protected]>
> wrote:
>
>> Based on a conversation in the sync today, we thought not explicitly
>> deprecating the field but providing guidance and documenting that new uses
>> for the field should go through a feature addition process is probably a
>> good path forward.
>>
>> I put up https://github.com/apache/parquet-format/pull/542 for a
>> straw-man to capture this.
>>
>> On Tue, Dec 9, 2025 at 6:33 PM Julien Le Dem <[email protected]> wrote:
>>
>>> IMO Iceberg needs to be aware of Parquet files referencing others so that
>>> it can prune older snapshots correctly and not delete parquet files
>>> referenced by others when deleting old snapshots. Depending if this is a
>>> cross table or within table file to file reference could make it more or
>>> less complicated.
>>>
>>> I could imagine starting with a simple implementation for the write path:
>>> "Create table foo using parquet options (reference_original_columns =
>>> true)
>>> as Select content, extract(content) as metadata from bar"
>>> That would be constrained to simple plans that have the scan and the
>>> output
>>> in the same step (map only) so that rows are in the same order per file.
>>> Alternatively "alter table foo add column metadata OPTIONS
>>> (column_familly
>>> = 'bar')" with a subsequent "update table set metadata=extract(content)"
>>> to
>>> create those files.
>>>
>>> (just some random thoughts, I'm sure others have spent more time thinking
>>> about this)
>>>
>>> This doesn't seem that different from the mechanism creating a deletion
>>> vector in Iceberg.
>>>
>>> It could also be seen as a view in iceberg joining on _row_id.
>>>
>>> This can be a topic in the meeting tomorrow.
>>>
>>> On Mon, Dec 8, 2025 at 9:02 AM Daniel Weeks <[email protected]> wrote:
>>>
>>> > Thanks for the context Kenny.  That example is very similar to some of
>>> the
>>> > cases that come up in the multi-modal scenarios.
>>> >
>>> > I agree that we're in a little bit of a difficult situation due to
>>> lack of
>>> > existing support, which also leads to Micah's concern that it's a
>>> point of
>>> > confusion for implementers.
>>> >
>>> > I would be in favor of adding some additional context to the
>>> description
>>> > because there are some basic things implementers should do (e.g.
>>> validate
>>> > that the file path is either not set or set to the current file being
>>> read
>>> > if they don't support disaggregated column data).  While older clients
>>> will
>>> > likely break if they encounter files written this way, there's almost
>>> no
>>> > risk that it would result in silent failures or corruption as I suspect
>>> > most implementations will read the ranges from the referencing file
>>> and not
>>> > be able to interpret it.
>>> >
>>> > Adding a read path is relatively straightforward (at least in the java
>>> > implementation for both stream and vectored IO reads), but the write
>>> path
>>> > is where things get more complicated.
>>> >
>>> > I think we want to discuss some of these use cases in more detail and
>>> see
>>> > if they are practical and reasonable.  Some cases may make more sense
>>> at a
>>> > higher-level (like table metadata) while others may make sense to
>>> handle at
>>> > the file level (like asymmetric column sizes).
>>> >
>>> > -Dan
>>> >
>>> >
>>> >
>>> > On Sun, Dec 7, 2025 at 12:35 PM Kenny Daniel <[email protected]>
>>> wrote:
>>> >
>>> > > Since I was the one who brought up file_path at the sync a couple
>>> weeks
>>> > > ago, I'll share my thoughts:
>>> > >
>>> > > I am interested in the file_path field for column chunks because it
>>> would
>>> > > allow for some extremely efficient data engineering in specific cases
>>> > > like *adding
>>> > > a column to existing data*.
>>> > >
>>> > > My use case is LLM data. LLM data is often huge piles of text in
>>> parquet
>>> > > format (see: all of huggingface, or any llm request/response logs).
>>> If I
>>> > > have a 400mb source.parquet file, how can I annotate each row with an
>>> > added
>>> > > "score" column efficiently? I would prefer to not have to copy all
>>> 400mb
>>> > of
>>> > > data just to add a "score" column. It would be slick if I could make
>>> a
>>> > new
>>> > > annotated.parquet file that points to source.parquet for the source
>>> > > columns, and then only includes the new "score" column in the
>>> > > annotated.parquet file. The source.parquet would remain 400mb, the
>>> > > annotated parquet could be ~10kb and incorporate the source data by
>>> > > reference.
>>> > >
>>> > > As the implementor of hyparquet I have conflicting opinions on this
>>> > > feature. On the one hand, it's a cool capability, already built into
>>> > > parquet. On the other hand... none of the parquet implementations
>>> support
>>> > > it. Hyparquet has a branch for reading/writing file_path that I used
>>> for
>>> > > testing. It does work. But I don't want to ship it unless theres at
>>> least
>>> > > ONE other implementation that supports it (there isn't).
>>> > >
>>> > > I agree that this would be better implemented at the table format
>>> level
>>> > > (eg- iceberg). BUT... *iceberg does not support my adding column use
>>> > case*!
>>> > > The problem is that, despite parquet being a column-oriented format,
>>> > > iceberg has no support to efficiently zip a new column with existing
>>> > data.
>>> > > The only option for "add column" in iceberg would be to *add a column
>>> > with
>>> > > default values and then re-write every row* (including the heavy text
>>> > > data). So iceberg fails to solve my problem at all.
>>> > >
>>> > > Anyway, I'm fine with deprecating, or not. But I did want to at least
>>> > make
>>> > > the case that it could serve a purpose that I don't see any other
>>> good
>>> > way
>>> > > of solving at the moment.
>>> > >
>>> > > Kenny
>>> > >
>>> > >
>>> > >
>>> > > On Fri, Dec 5, 2025 at 9:46 PM Micah Kornfield <
>>> [email protected]>
>>> > > wrote:
>>> > >
>>> > > > Hi Dan,
>>> > > >
>>> > > > > However, there are ongoing discussions around multi-modal cases
>>> where
>>> > > > > either separating large columns (e.g. inline blobs) or appending
>>> > column
>>> > > > > data without rewriting existing data may leverage this.
>>> > > >
>>> > > >
>>> > > > Do you have any design docs or mailing list discussions you can
>>> point
>>> > to?
>>> > > >
>>> > > > I don't feel like leaving this for now while we explore those use
>>> cases
>>> > > > > would cause any additional confusion/complexity.
>>> > > >
>>> > > >
>>> > > > Agreed, it isn't urgent to clean this up. But having a more
>>> concrete
>>> > > > timeline would be helpful, this does seem to be a semi-regular
>>> source
>>> > of
>>> > > > confusion for folks, so it would be nice to clean up the loose end.
>>> > > >
>>> > > > Thanks,
>>> > > > Micah
>>> > > >
>>> > > > On Fri, Dec 5, 2025 at 4:07 PM Daniel Weeks <[email protected]>
>>> wrote:
>>> > > >
>>> > > > > I'd actually prefer that we don't deprecate this field (at least
>>> not
>>> > > > > immediately).
>>> > > > >
>>> > > > > Recognizing that we've discussed separating column data into
>>> multiple
>>> > > > files
>>> > > > > for over a decade without any concrete implementations, there are
>>> > > > emerging
>>> > > > > use cases that may benefit from investing in this feature.
>>> > > > >
>>> > > > > Many of the use cases in the past have been misaligned (e.g.
>>> > separating
>>> > > > > column data for security/encryption) and better alternatives
>>> > addressed
>>> > > > > those scenarios.
>>> > > > >
>>> > > > > However, there are ongoing discussions around multi-modal cases
>>> where
>>> > > > > either separating large columns (e.g. inline blobs) or appending
>>> > column
>>> > > > > data without rewriting existing data may leverage this.
>>> > > > >
>>> > > > > I don't feel like leaving this for now while we explore those use
>>> > cases
>>> > > > > would cause any additional confusion/complexity.
>>> > > > >
>>> > > > > -Dan
>>> > > > >
>>> > > > > On Thu, Dec 4, 2025 at 9:04 AM Micah Kornfield <
>>> > [email protected]>
>>> > > > > wrote:
>>> > > > >
>>> > > > > > > What does "deprecated" entail here? Do we plan to remove this
>>> > field
>>> > > > > > from the format? Otherwise, is it just documentation?
>>> > > > > >
>>> > > > > > I was imagining just documentation, since we don't want to
>>> break
>>> > the
>>> > > > > > "_metadata file" use case.
>>> > > > > >
>>> > > > > > On Thu, Dec 4, 2025 at 8:18 AM Antoine Pitrou <
>>> [email protected]>
>>> > > > > wrote:
>>> > > > > >
>>> > > > > > >
>>> > > > > > > What does "deprecated" entail here? Do we plan to remove this
>>> > field
>>> > > > > > > from the format? Otherwise, is it just documentation?
>>> > > > > > >
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > On Mon, 1 Dec 2025 12:09:18 -0800
>>> > > > > > > Micah Kornfield <[email protected]>
>>> > > > > > > wrote:
>>> > > > > > > > This has come up a few times in the sync and other
>>> forums.  I
>>> > > > wanted
>>> > > > > to
>>> > > > > > > > start the conversation about deprecating file_path
>>> > > > > > > > <
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962
>>> > > > > > > >
>>> > > > > > > > [1] in the parquet footer.
>>> > > > > > > >
>>> > > > > > > > Outside of the "_metadata" file index use-case I don't
>>> think
>>> > this
>>> > > > is
>>> > > > > > used
>>> > > > > > > > or implemented in any reader (effectively a poor man's
>>> table
>>> > > > format).
>>> > > > > > > >
>>> > > > > > > > With the rise of file formats, it seems like a reasonable
>>> > design
>>> > > > > choice
>>> > > > > > > to
>>> > > > > > > > push complexity of referencing columns across files to the
>>> > table
>>> > > > > level
>>> > > > > > > and
>>> > > > > > > > keep parquet focused on single file storage (encodings,
>>> > indexing,
>>> > > > > etc).
>>> > > > > > > >
>>> > > > > > > > Implementing this at a file level also can be challenging
>>> in
>>> > the
>>> > > > > > context
>>> > > > > > > of
>>> > > > > > > > knowing all credentials one might need to read from
>>> different
>>> > > > objects
>>> > > > > > on
>>> > > > > > > > object storage?
>>> > > > > > > >
>>> > > > > > > > Thoughts/Objections?
>>> > > > > > > >
>>> > > > > > > > Thanks,
>>> > > > > > > > Micah
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > > [1]
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962
>>> > > > > > > >
>>> > > > > > >
>>> > > > > > >
>>> > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>

Re: [DISCUSS] Deprecate file_path field in column chunk

Reply via email to