Re: Parquet File Meta Data & Compatibility

Micah Kornfield Sun, 06 Dec 2020 21:16:39 -0800

I tried to get some level of clarification in a PR [1].  It kind of stalled
because we had further conversation on a sync call a while ago that I have
not had a chance to follow-up on.  I'm happy to revise if we can come up
with some sort of consensus for experimental/non-experimental.


The things that were decided on the sync call were:
1.  Does it make sense to deprecate data page v2.  It sounded like this
isn't widely used and the main benefits are either unused [uncompressed
rep/def level], or are already required from other features (e.g. record
level alignment within a data page is required with indices IIUC).
2.  Potentially accepting that all encodings in V1 and V2 are acceptable
for V1 data pages.

I think there is a lot more conversation here.  But I agree with everything
said on this thread we more clarity on what is required/not required for
implementations.

 e.g. I don't think the delta encoding reflects current best practices in
> the literature.


Could you expand on this?


[1] https://github.com/apache/parquet-format/pull/163

On Fri, Dec 4, 2020 at 3:57 PM Antoine Pitrou <[email protected]> wrote:

> On Fri, 4 Dec 2020 11:21:58 -0800
> Tim Armstrong
> <[email protected]> wrote:
> > I probably didn't say it very clearly, but my opinion as a consumer of
> the
> > Parquet spec is that the format needs a reset where encodings, logical
> > types and other metadata that are not widely adopted are removed from the
> > core spec and put in an "experimental" category from which we can later
> > promote things based on community consensus.
>
> I'm not a participant in this community but I'm one of the people who
> maintain parquet-cpp as part the Arrow C++ codebase, and your proposal
> sounds like a very good idea to me.
>
> Even the core parts of Parquet can be tricky to implement efficiently,
> adding more and more features and encodings is IMHO very dangerous for
> the format's portability.
>
> Regards
>
> Antoine.
>
>
> >
> > On Fri, Dec 4, 2020 at 11:14 AM Tim Armstrong <[email protected]>
> > wrote:
> >
> > > I think it would be good for the project to define a core set of
> features
> > > that a Parquet implementation must support to be able to correctly read
> > > files all written by another compliant writer with the same version.
> > >
> > > There are then additional extensions like page indices that are not
> > > required to actually be able to read/write the files correctly.
> > >
> > > My understanding of how we got here is that the Parquet 2.0 spec
> defined a
> > > number of new features and encodings that were only implemented in
> > > parquet-mr. Many of these have not gotten widely implemented, and I'm
> also
> > > not sure that the encodings are particularly well designed, e.g. I
> don't
> > > think the delta encoding reflects current best practices in the
> literature.
> > >
> > > E.g. in Impala we did not implement many of the new Parquet 2.0
> encodings
> > > because it was a significant amount of work and we had a lot of
> competing
> > > priorities. It's now a chicken and egg problem because there's isn't
> much
> > > data encoded in that way out there and thus not much motivation to
> invest
> > > in the work. We've had some similar headaches with logic types and
> > > encodings, e.g. where the standard allows several degrees of freedom
> in how
> > > to encode DECIMALs, including ones that don't make much sense. We've
> wasted
> > > significant time adding support for DECIMALs that were written in
> strange
> > > ways by another writer.
> > >
> > > In the meantime we did actually need to add other new features to
> Parquet
> > > like page indices, so moved forward with those, which is part of how
> it's
> > > turned into a bit of a matrix.
> > >
> > > My suggestion is to go through the encodings and features and flag the
> > > ones that are not universally implemented as "EXPERIMENTAL" or
> something
> > > similar, and make it clear that they are optional and may be
> deprecated if
> > > they don't get adopted. I think that would give implementations a
> cleaner
> > > baseline to compare against, then we can start adding core features on
> top
> > > of that.
> > >
> > > IMO we shouldn't promote features to being part of the core spec until
> > > there are at least 2 reference implementations that work with each
> other.
> > > That way we can use the reference implementations to clarify any bugs
> in
> > > the spec.
> > >
> > > This is all kinda hard and a little thankless so I don't blame anyone
> for
> > > not taking it on yet!
> > >
> > > > * The thrift definition says that this is "Byte offset in file_path
> to
> > > the ColumnMetaData" but this doesn't make too much sense, as the
> > > ColumnMetaData is contained inside the ColumnChunk itself (and
> therefore,
> > > you cannot even know its offset when writing the column chunk).
> > > It probably makes sense to update the spec to reflect what actual
> > > implementations do.
> > >
> > > > * This is defined to be int16_t, which can quickly overflow for
> very
> > > large parquet files or smaller row groups. What should I do if I
> anticipate
> > > that my library will be used to write files where this will overflow?
> Just
> > > use it for the first 2^15 row groups and then leave it out? Or don't
> write
> > > it at all for any row group?
> > >
> > > 16k+ row groups in a file seems goofy to me. I can't think of a way
> that
> > > you could hit that limit with a reasonable file layout. Row groups are
> > > meant to be big enough that you amortise the metadata overhead and get
> good
> > > compression over a column chunk. I'd expect row groups to be 10s of MB
> at
> > > least typically. Maybe this would be something that should be
> clarified in
> > > the spec or other docs.
> > >
> > >
> > > On Fri, Oct 16, 2020 at 3:46 PM Micah Kornfield <[email protected]
> >
> > > wrote:
> > >
> > >> >
> > >> > IMHO, shouldn't the spec mention - quite precisely - what versions
> exist
> > >> > and what features can be used in which version, so an
> implementation can
> > >> > say "yes, I can fully write this versions" or "no, I can't" instead
> of
> > >> > having a fuzzy set of features where some are described to "not
> work on
> > >> > most readers". Even if such a precise mapping isn't defined today,
> > >> > shouldn't it be defined at least in retrospect, so that
> implementations
> > >> can
> > >> > start checking and documenting, what versions they can write?
> > >>
> > >>
> > >> Just an FYI I've already raised these concerns recently [1].  But I
> > >> completely agree this is needed, I'd offer to update the docs, but I
> don't
> > >> really have the historical context to do it.
> > >>
> > >> [1]
> > >>
> > >>
> https://mail-archives.apache.org/mod_mbox/parquet-dev/202010.mbox/%3CCAK7Z5T8c4Zpwj_yWZ9Kr4gTueGrMq%2B0NAqADJ7rnkBaUqtPwuQ%40mail.gmail.com%3E
> > >>
> > >> On Fri, Oct 16, 2020 at 3:27 PM Jan Finis <[email protected]
> >
> > >> wrote:
> > >>
> > >> > Hey folks,
> > >> >
> > >> > First of all, thanks for this great project!
> > >> >
> > >> > I am currently writing a library for reading/writing parquet, and I
> am a
> > >> > bit confused by some points, which I would like to discuss here. I
> think
> > >> > they will be relevant to anyone wanting to write own parquet
> > >> > reading/writing logic in a new language.
> > >> >
> > >> >
> > >> > MetaData:
> > >> >
> > >> > There are some fields in the metadata whose semantics is unclear.
> Can
> > >> you
> > >> > clarify this:
> > >> >
> > >> > ColumnChunk.file_offset:
> > >> >
> > >> > * The thrift definition says that this is "Byte offset in file_path
> to
> > >> the
> > >> > ColumnMetaData" but this doesn't make too much sense, as the
> > >> ColumnMetaData
> > >> > is contained inside the ColumnChunk itself (and therefore, you
> cannot
> > >> even
> > >> > know its offset when writing the column chunk).
> > >> > parquet-mr seems to just write the offset of the first data/dict
> page
> > >> into
> > >> > this field, which doesn't seem to comply with what the spec says
> (but
> > >> is at
> > >> > least possible). What is this field supposed to be used for? My
> reader
> > >> just
> > >> > ignores it, but my writer should make sure to write the most
> sensible
> > >> value
> > >> > in here, in case some reader relies on it.
> > >> >
> > >> > RowGroup.ordinal:
> > >> >
> > >> > * This seems to be just the ordinal of the row group. However, this
> > >> > information seems redundant, as the row-groups meta data is
> already
> > >> stored
> > >> > in a specific order in the footer. Should the ordinal just be equal
> to
> > >> that
> > >> > order, or can it differ?
> > >> > * Will any reader rely on this, since it's optional?
> > >> > * This is defined to be int16_t, which can quickly overflow for
> very
> > >> large
> > >> > parquet files or smaller row groups. What should I do if I
> anticipate
> > >> that
> > >> > my library will be used to write files where this will overflow?
> Just
> > >> use
> > >> > it for the first 2^15 row groups and then leave it out? Or don't
> write
> > >> it
> > >> > at all for any row group?
> > >> >
> > >> > Compatibility:
> > >> >
> > >> >
> > >> >   1.  For writing: What compatibility versions are there? The spec
> > >> talks a
> > >> > lot about compatibility and features that not all readers can read,
> but
> > >> it
> > >> > never specifies things like "this encoding is version X.Y and
> upwards".
> > >> > So, when writing a parquet file, I have some problems in choosing
> > >> features.
> > >> > E.g.,
> > >> > * should I write DataPageV1 or DataPageV2?
> > >> > * Should I use DELTA_BYTE_ARRAY/DELTA_BINARY_PACKED?
> > >> > * Should I use BYTE_STREAM_SPLIT?
> > >> >
> > >> >   2.  For reading: I have implemented all encodings except
> BIT_PACKED,
> > >> > which seems deprecated for a long time (and would require all the
> > >> > bitunpack-on-big-endian logic, which would be a lot of work). How
> safe
> > >> can
> > >> > I be that this encoding is no longer used? When was it last used?
> Since
> > >> > when is it deprecated?
> > >> >
> > >> >
> > >> > IMHO, shouldn't the spec mention - quite precisely - what versions
> exist
> > >> > and what features can be used in which version, so an
> implementation can
> > >> > say "yes, I can fully write this versions" or "no, I can't" instead
> of
> > >> > having a fuzzy set of features where some are described to "not
> work on
> > >> > most readers". Even if such a precise mapping isn't defined today,
> > >> > shouldn't it be defined at least in retrospect, so that
> implementations
> > >> can
> > >> > start checking and documenting, what versions they can write?
> > >> >
> > >> > To give you an example where I already encountered this: parquet-cpp
> > >> > doesn't seem to be able to read DELTA_BINARY_PACKED yet, so any
> software
> > >> > built on top of it (e.g., pyarrow) cannot read such files. I
> created
> > >> such a
> > >> > file with parquet-mr and was very surprised when pyarrow couldn't
> read
> > >> it.
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > Source of truth:
> > >> >
> > >> > In general, what is the agreed-upon source of truth for parquet? Is
> it
> > >> the
> > >> > documents parquet-format, or is it the implementation in
> parquet-mr?
> > >> These
> > >> > differ sometimes, so which one should I adhere to if they do?
> > >> >
> > >> >
> > >> > Thanks in advance for any answers.
> > >> > Cheers,
> > >> > Jan
> > >> >
> > >>
> > >
> >
>
>
>
>

Re: Parquet File Meta Data & Compatibility

Reply via email to