I think the main argument here is whether the behavioral
change to this field will actually break any reader implementation.
Considering the current state of parquet-java, I would guess no.
But I agree that let's modify the comment of the spec to make it
clear and do the right thing on the writer's side to be consistent.

Best,
Gang

On Tue, Jun 25, 2024 at 10:30 PM Ed Seidl <etse...@live.com> wrote:

> The issue I have is that we're currently in a position where a file
> written to the letter of the specification will likely be readable by
> none of the major parquet implementations. (I'm going to test this
> hypothesis today). If the ColumnMetaData in the footer is de facto
> required, then I think we should at a minimum change the thrift to make
> it so. Similarly, the reference implementation (parquet-java) currently
> does not write the required metadata, and sets file_offset to an invalid
> (but valid seeming) value. If we don't change the requiredness of
> file_offset, then either parquet-java needs to start writing the
> metadata inline with the chunk data and set file_offset correctly, or,
> as I've proposed elsewhere[1], simply write 0 for the required field,
> with the understanding that this means the metadata is not present (and
> modify the wording in the spec to make this approach valid).
>
> So, to my mind, the goal isn't to avoid confusion, it's to have the
> specification match current reality.
>
> Regards,
> Ed
>
> [1] https://github.com/apache/parquet-java/pull/1369
>
> On 6/25/24 6:33 AM, Andrew Lamb wrote:
> > If the goal of this exercise is to avoid confusion, I agree with Michah
> > that updating parquet.thrift is best. Here [3] is a PR to update the
> thrift
> > file to clarify that the field is not written by all writers and is not
> > read by many.
> >
> > In my opinion any backwards incompatible changes do nothing other risk
> > making parquet files less compatible with the ecosystem
> >
> > While removing the field is a technically more elegant solution (would
> make
> > code cleaner), it could only cause potential incompatibilities for
> users. I
> > prefer to have more complex code but a better user experience.
> >
> > BTW the Rust parquet writer sets the file_offset field[1] but does not
> > appear to use it on read. Instead it assumes column_metadata is
> present[2]
> >
> > Andrew
> >
> > [1]:
> >
> https://github.com/apache/arrow-rs/blob/b3f06f6cc4d4f4431a1f86cfc9f30de3a1fc1a1b/parquet/src/column/writer/mod.rs#L904-L907
> > [2]:
> >
> https://github.com/apache/arrow-rs/blob/ed018a34d996590544fe5e833cb601bf46e9758e/parquet/src/file/metadata.rs#L673-L672
> > [3]: https://github.com/apache/parquet-format/pull/439
> >
> >
> > On Tue, Jun 25, 2024 at 4:40 AM Alkis Evlogimenos
> > <alkis.evlogime...@databricks.com.invalid> wrote:
> >
> >> We need a mechanism to remove fields. Typically this would involve some
> >> time horizon.
> >>
> >> I suggest we establish a deprecation horizon now, say 3y, and start the
> >> clocks ticking. Plus some convention for marking deprecated fields
> because
> >> the thrift IDL lacks a way to do this in code. I propose the annotation
> `//
> >> DEPRECATED-EOL-20270421` followed by a description on what happens in
> the
> >> interim. For example for this field:
> >>
> >> ```
> >>    /** Byte offset in file_path to the ColumnMetaData **/
> >>    // DEPRECATED-EOL-20270421
> >>    // Before 20240625 this field was required. Since then it was made
> >> optional. New writers MUST write it util EOL to support old readers.
> >>    2: optional i64 file_offset
> >> ```
> >>
> >>
> >> On Tue, Jun 25, 2024 at 9:23 AM Micah Kornfield <emkornfi...@gmail.com>
> >> wrote:
> >>
> >>>> but I'm not clear on how that will
> >>>> impact existing parsers.
> >>>
> >>> This can break older parsers, that validate required fields are in fact
> >>> present.  I think it would be best to just update documentation on the
> >>> current state of affairs, and let implementations update accordingly if
> >>> necessary.
> >>>
> >>> On Mon, Jun 24, 2024 at 3:21 PM Ed Seidl <etse...@live.com> wrote:
> >>>
> >>>> Resurrecting a thread from earlier in the month regarding inconsistent
> >>>> use of the file_offset field [1][2]. It seems like the preferred path
> >>>> forward is to deprecate this (AFAICT) unused field to prevent further
> >>>> confusion. If there are no violent objections, I'll submit a PR to do
> >> so
> >>>> in a few days.
> >>>>
> >>>> One question I have, though, is how to handle the requiredness of the
> >>>> file_offset (currently required) and meta_data (currently optional)
> >>>> fields in ColumnChunk. I'd prefer to switch them, and make file_offset
> >>>> optional and meta_data required, but I'm not clear on how that will
> >>>> impact existing parsers. I believe most (all) implementations ignore
> >>>> file_offset anyway, and expect meta_data to be present, so maybe this
> >> is
> >>>> a non-issue.
> >>>>
> >>>> Thanks,
> >>>> Ed
> >>>>
> >>>> [1] https://lists.apache.org/thread/q5r43ks61q4wcbvwsk1jyw4h30fvg68t
> >>>> [2] https://github.com/apache/parquet-java/pull/1369
> >>>>
>
>

Reply via email to