Re: Parquet File Meta Data & Compatibility

Gabor Szadovszky Mon, 07 Dec 2020 07:57:26 -0800

I agree on separating the non widely used features to make the life of the
implementers easier and to improve compatibility between these
implementations.
Meanwhile, it is not always clear how to define the core features. For
example, the encryption feature will be released soon in parquet-mr and I
guess parquet-cpp as well (if not released yet). Would it be part of the
core features already?


Another thing is the compression. Which one would be part of the core
features and which one would we like to drop (moving to experimental)?

We can skip features like column indexes from the core ones as they are not
required to read the data written in the file. Meanwhile, if they are never
part of the core features we might generate a lot of files without the
related metadata hurting the read performance of the implementations where
the feature is implemented. (BTW for column indexes we already have the two
cross-tested implementations.)

Introducing new logical types as "experimental" is a bit tricky. In
parquet-mr we do not give too much help to our consumers for logical types.
For example for a UUID logical type we still give the user a Binary object
instead of a java UUID. So waiting for other implementations does not
really make sense. I also don't think it would be a good idea to keep these
logical types "experimental" until some of our consumers start using them.
Nobody wants to start using and invest in experimental features.

I think encodings are an easier topic as they aren't visible for the
outside (like the file schema). But we still have to list the ones we would
like to have in the "core features". From the parquet-mr point of view the
"V1" encodings are used widely while the "V2" ones are not, so I would move
the latter (RLE = 3, DELTA_BYTE_ARRAY = 7, DELTA_BINARY_PACKED = 5,
RLE_DICTIONARY = 8) to the experimental category.
(NOTE: For the parquet-mr implementation I would suggest to support the
experimental encodings by adding a property where the user can set the
encodings by column accepting the risk of being incompatible with other
implementations.)

One more thing about the core features. If we finally agree on a set of
these how do we put new features in after a while? I mean how do we
"version" the list of core features? Semantic versioning can be an obvious
choice and we can even use the version of parquet-format. In this case we
enforce the implementers to implement all the new core features if they
want to keep track of the versions. Is that what we want?



On Mon, Dec 7, 2020 at 6:16 AM Micah Kornfield <[email protected]>
wrote:

> I tried to get some level of clarification in a PR [1].  It kind of stalled
> because we had further conversation on a sync call a while ago that I have
> not had a chance to follow-up on.  I'm happy to revise if we can come up
> with some sort of consensus for experimental/non-experimental.
>
> The things that were decided on the sync call were:
> 1.  Does it make sense to deprecate data page v2.  It sounded like this
> isn't widely used and the main benefits are either unused [uncompressed
> rep/def level], or are already required from other features (e.g. record
> level alignment within a data page is required with indices IIUC).
> 2.  Potentially accepting that all encodings in V1 and V2 are acceptable
> for V1 data pages.
>
> I think there is a lot more conversation here.  But I agree with everything
> said on this thread we more clarity on what is required/not required for
> implementations.
>
>  e.g. I don't think the delta encoding reflects current best practices in
> > the literature.
>
>
> Could you expand on this?
>
>
> [1] https://github.com/apache/parquet-format/pull/163
>
> On Fri, Dec 4, 2020 at 3:57 PM Antoine Pitrou <[email protected]> wrote:
>
> > On Fri, 4 Dec 2020 11:21:58 -0800
> > Tim Armstrong
> > <[email protected]> wrote:
> > > I probably didn't say it very clearly, but my opinion as a consumer of
> > the
> > > Parquet spec is that the format needs a reset where encodings, logical
> > > types and other metadata that are not widely adopted are removed from
> the
> > > core spec and put in an "experimental" category from which we can later
> > > promote things based on community consensus.
> >
> > I'm not a participant in this community but I'm one of the people who
> > maintain parquet-cpp as part the Arrow C++ codebase, and your proposal
> > sounds like a very good idea to me.
> >
> > Even the core parts of Parquet can be tricky to implement efficiently,
> > adding more and more features and encodings is IMHO very dangerous for
> > the format's portability.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > >
> > > On Fri, Dec 4, 2020 at 11:14 AM Tim Armstrong <[email protected]
> >
> > > wrote:
> > >
> > > > I think it would be good for the project to define a core set of
> > features
> > > > that a Parquet implementation must support to be able to correctly
> read
> > > > files all written by another compliant writer with the same version.
> > > >
> > > > There are then additional extensions like page indices that are not
> > > > required to actually be able to read/write the files correctly.
> > > >
> > > > My understanding of how we got here is that the Parquet 2.0 spec
> > defined a
> > > > number of new features and encodings that were only implemented in
> > > > parquet-mr. Many of these have not gotten widely implemented, and I'm
> > also
> > > > not sure that the encodings are particularly well designed, e.g. I
> > don't
> > > > think the delta encoding reflects current best practices in the
> > literature.
> > > >
> > > > E.g. in Impala we did not implement many of the new Parquet 2.0
> > encodings
> > > > because it was a significant amount of work and we had a lot of
> > competing
> > > > priorities. It's now a chicken and egg problem because there's isn't
> > much
> > > > data encoded in that way out there and thus not much motivation to
> > invest
> > > > in the work. We've had some similar headaches with logic types and
> > > > encodings, e.g. where the standard allows several degrees of freedom
> > in how
> > > > to encode DECIMALs, including ones that don't make much sense. We've
> > wasted
> > > > significant time adding support for DECIMALs that were written in
> > strange
> > > > ways by another writer.
> > > >
> > > > In the meantime we did actually need to add other new features to
> > Parquet
> > > > like page indices, so moved forward with those, which is part of how
> > it's
> > > > turned into a bit of a matrix.
> > > >
> > > > My suggestion is to go through the encodings and features and flag
> the
> > > > ones that are not universally implemented as "EXPERIMENTAL" or
> > something
> > > > similar, and make it clear that they are optional and may be
> > deprecated if
> > > > they don't get adopted. I think that would give implementations a
> > cleaner
> > > > baseline to compare against, then we can start adding core features
> on
> > top
> > > > of that.
> > > >
> > > > IMO we shouldn't promote features to being part of the core spec
> until
> > > > there are at least 2 reference implementations that work with each
> > other.
> > > > That way we can use the reference implementations to clarify any bugs
> > in
> > > > the spec.
> > > >
> > > > This is all kinda hard and a little thankless so I don't blame anyone
> > for
> > > > not taking it on yet!
> > > >
> > > > > * The thrift definition says that this is "Byte offset in file_path
> > to
> > > > the ColumnMetaData" but this doesn't make too much sense, as the
> > > > ColumnMetaData is contained inside the ColumnChunk itself (and
> > therefore,
> > > > you cannot even know its offset when writing the column chunk).
> > > > It probably makes sense to update the spec to reflect what actual
> > > > implementations do.
> > > >
> > > > > * This is defined to be int16_t, which can quickly overflow for
> > very
> > > > large parquet files or smaller row groups. What should I do if I
> > anticipate
> > > > that my library will be used to write files where this will overflow?
> > Just
> > > > use it for the first 2^15 row groups and then leave it out? Or don't
> > write
> > > > it at all for any row group?
> > > >
> > > > 16k+ row groups in a file seems goofy to me. I can't think of a way
> > that
> > > > you could hit that limit with a reasonable file layout. Row groups
> are
> > > > meant to be big enough that you amortise the metadata overhead and
> get
> > good
> > > > compression over a column chunk. I'd expect row groups to be 10s of
> MB
> > at
> > > > least typically. Maybe this would be something that should be
> > clarified in
> > > > the spec or other docs.
> > > >
> > > >
> > > > On Fri, Oct 16, 2020 at 3:46 PM Micah Kornfield <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > >> >
> > > >> > IMHO, shouldn't the spec mention - quite precisely - what versions
> > exist
> > > >> > and what features can be used in which version, so an
> > implementation can
> > > >> > say "yes, I can fully write this versions" or "no, I can't"
> instead
> > of
> > > >> > having a fuzzy set of features where some are described to "not
> > work on
> > > >> > most readers". Even if such a precise mapping isn't defined today,
> > > >> > shouldn't it be defined at least in retrospect, so that
> > implementations
> > > >> can
> > > >> > start checking and documenting, what versions they can write?
> > > >>
> > > >>
> > > >> Just an FYI I've already raised these concerns recently [1].  But I
> > > >> completely agree this is needed, I'd offer to update the docs, but I
> > don't
> > > >> really have the historical context to do it.
> > > >>
> > > >> [1]
> > > >>
> > > >>
> >
> https://mail-archives.apache.org/mod_mbox/parquet-dev/202010.mbox/%3CCAK7Z5T8c4Zpwj_yWZ9Kr4gTueGrMq%2B0NAqADJ7rnkBaUqtPwuQ%40mail.gmail.com%3E
> > > >>
> > > >> On Fri, Oct 16, 2020 at 3:27 PM Jan Finis
> <[email protected]
> > >
> > > >> wrote:
> > > >>
> > > >> > Hey folks,
> > > >> >
> > > >> > First of all, thanks for this great project!
> > > >> >
> > > >> > I am currently writing a library for reading/writing parquet, and
> I
> > am a
> > > >> > bit confused by some points, which I would like to discuss here. I
> > think
> > > >> > they will be relevant to anyone wanting to write own parquet
> > > >> > reading/writing logic in a new language.
> > > >> >
> > > >> >
> > > >> > MetaData:
> > > >> >
> > > >> > There are some fields in the metadata whose semantics is unclear.
> > Can
> > > >> you
> > > >> > clarify this:
> > > >> >
> > > >> > ColumnChunk.file_offset:
> > > >> >
> > > >> > * The thrift definition says that this is "Byte offset in
> file_path
> > to
> > > >> the
> > > >> > ColumnMetaData" but this doesn't make too much sense, as the
> > > >> ColumnMetaData
> > > >> > is contained inside the ColumnChunk itself (and therefore, you
> > cannot
> > > >> even
> > > >> > know its offset when writing the column chunk).
> > > >> > parquet-mr seems to just write the offset of the first data/dict
> > page
> > > >> into
> > > >> > this field, which doesn't seem to comply with what the spec says
> > (but
> > > >> is at
> > > >> > least possible). What is this field supposed to be used for? My
> > reader
> > > >> just
> > > >> > ignores it, but my writer should make sure to write the most
> > sensible
> > > >> value
> > > >> > in here, in case some reader relies on it.
> > > >> >
> > > >> > RowGroup.ordinal:
> > > >> >
> > > >> > * This seems to be just the ordinal of the row group. However,
> this
> > > >> > information seems redundant, as the row-groups meta data is
> > already
> > > >> stored
> > > >> > in a specific order in the footer. Should the ordinal just be
> equal
> > to
> > > >> that
> > > >> > order, or can it differ?
> > > >> > * Will any reader rely on this, since it's optional?
> > > >> > * This is defined to be int16_t, which can quickly overflow for
> > very
> > > >> large
> > > >> > parquet files or smaller row groups. What should I do if I
> > anticipate
> > > >> that
> > > >> > my library will be used to write files where this will overflow?
> > Just
> > > >> use
> > > >> > it for the first 2^15 row groups and then leave it out? Or don't
> > write
> > > >> it
> > > >> > at all for any row group?
> > > >> >
> > > >> > Compatibility:
> > > >> >
> > > >> >
> > > >> >   1.  For writing: What compatibility versions are there? The spec
> > > >> talks a
> > > >> > lot about compatibility and features that not all readers can
> read,
> > but
> > > >> it
> > > >> > never specifies things like "this encoding is version X.Y and
> > upwards".
> > > >> > So, when writing a parquet file, I have some problems in choosing
> > > >> features.
> > > >> > E.g.,
> > > >> > * should I write DataPageV1 or DataPageV2?
> > > >> > * Should I use DELTA_BYTE_ARRAY/DELTA_BINARY_PACKED?
> > > >> > * Should I use BYTE_STREAM_SPLIT?
> > > >> >
> > > >> >   2.  For reading: I have implemented all encodings except
> > BIT_PACKED,
> > > >> > which seems deprecated for a long time (and would require all the
> > > >> > bitunpack-on-big-endian logic, which would be a lot of work). How
> > safe
> > > >> can
> > > >> > I be that this encoding is no longer used? When was it last used?
> > Since
> > > >> > when is it deprecated?
> > > >> >
> > > >> >
> > > >> > IMHO, shouldn't the spec mention - quite precisely - what versions
> > exist
> > > >> > and what features can be used in which version, so an
> > implementation can
> > > >> > say "yes, I can fully write this versions" or "no, I can't"
> instead
> > of
> > > >> > having a fuzzy set of features where some are described to "not
> > work on
> > > >> > most readers". Even if such a precise mapping isn't defined today,
> > > >> > shouldn't it be defined at least in retrospect, so that
> > implementations
> > > >> can
> > > >> > start checking and documenting, what versions they can write?
> > > >> >
> > > >> > To give you an example where I already encountered this:
> parquet-cpp
> > > >> > doesn't seem to be able to read DELTA_BINARY_PACKED yet, so any
> > software
> > > >> > built on top of it (e.g., pyarrow) cannot read such files. I
> > created
> > > >> such a
> > > >> > file with parquet-mr and was very surprised when pyarrow couldn't
> > read
> > > >> it.
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > Source of truth:
> > > >> >
> > > >> > In general, what is the agreed-upon source of truth for parquet?
> Is
> > it
> > > >> the
> > > >> > documents parquet-format, or is it the implementation in
> > parquet-mr?
> > > >> These
> > > >> > differ sometimes, so which one should I adhere to if they do?
> > > >> >
> > > >> >
> > > >> > Thanks in advance for any answers.
> > > >> > Cheers,
> > > >> > Jan
> > > >> >
> > > >>
> > > >
> > >
> >
> >
> >
> >
>

Re: Parquet File Meta Data & Compatibility

Reply via email to