Re: Parquet File Meta Data & Compatibility

Gabor Szadovszky Tue, 08 Dec 2020 04:26:34 -0800

Without discussing further what should be included in the list of core
features I would propose a framing of this idea.


Instead of commenting the different objects in the thrift file (or in the
other related documentations) for being "experimental" or so I think it is
more clear to have a separate doc in the parquet-format repo that defines
the core features. (The thrift file is more about listing all the features
we are able to use.)
This new documentation is versioned by the parquet-format releases so we
have the exact list of core features for each release.

We can also introduce a new (optional) field in the thrift file (e.g.
complianceLevel) where any implementation can write the parquet-format
release which core features are implemented. (Using a separate field
instead of an additional key-value makes it easier to reference the core
features documentation from the thrift file.)

If you like this idea I am happy to create a jira and a PR where we can
start/continue the discussion about the core features themselves.


On Mon, Dec 7, 2020 at 9:00 PM Tim Armstrong
<[email protected]> wrote:

> > Introducing new logical types as "experimental" is a bit tricky.
> Maybe experimental is a bad term. I think mostly new features in the format
> do need to be backwards compatible and not buggy because data lasts a long
> time once it's written. Maybe "incubating" or "preview" is a better term. I
> guess it's important that a) they are not automatically enabled so users
> don't accidentally write out data other tools can't read and b) that the
> format is clear that readers implementing version x don't need to be able
> to read data that uses incubating features from version x in order to be
> compliant.
>
> >One more thing about the core features. If we finally agree on a set of
> > these how do we put new features in after a while? I mean how do we
> > "version" the list of core features? Semantic versioning can be an
> obvious
> > choice and we can even use the version of parquet-format. In this case we
> > enforce the implementers to implement all the new core features if they
> > want to keep track of the versions. Is that what we want?
> I think that would be an improvement from my point of view if we released a
> new version of parquet-format with fewer core features but clear wording
> that the core features *must* be implemented. That could add a new stick
> for implementors - they are not in compliance with the standard if they
> can't read all core feature or if they write non-compliant data. But
> there's also a carrot - you can claim to be in compliance with the standard
> and you don't need to deal with as many edge cases from other writers.
>
> E.g. for Impala it would be great for us to know exactly what our gaps are
> in reading, instead of having to add new support when it comes up. But I
> think we'd have to do some things to be better Parquet citizens, e.g.
> writing int64 instead of int96 timestamps by default.
>
>
> > Could you expand on this?
> I was probably overstating this. I took another look and it seems like it's
> probably still a defensible decision, but I'm not sure that it's the ideal
> choice. It gets decent compression, it's fast to encode/decode with SIMD
> and it's simpler than some alternatives. It was made 8 years ago and I'm
> not sure of the exact rationale. -
>
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5
> .
>
>
> The paper it's taken from actually shows that some other encodings
> that handle exceptional values via patching (their PFOR schemes) achieve
> better compression on realistic data.
> https://people.csail.mit.edu/jshun/6886-s19/lectures/lecture19-1.pdf has a
> more accessible summary of this. I thought there had been some more
> improvements in the literature since 2012 but I wasn't able to find much so
> may have been misremembering. I did come across this paper about
> implementing the Parquet delta decoding more efficiently, but that didn't
> require changes to the format -
> https://newtraell.cs.uchicago.edu/files/ms_paper/hajiang.pdf.
>
> On Mon, Dec 7, 2020 at 7:57 AM Gabor Szadovszky <[email protected]> wrote:
>
> > I agree on separating the non widely used features to make the life of
> the
> > implementers easier and to improve compatibility between these
> > implementations.
> > Meanwhile, it is not always clear how to define the core features. For
> > example, the encryption feature will be released soon in parquet-mr and I
> > guess parquet-cpp as well (if not released yet). Would it be part of the
> > core features already?
> >
> > Another thing is the compression. Which one would be part of the core
> > features and which one would we like to drop (moving to experimental)?
> >
> > We can skip features like column indexes from the core ones as they are
> not
> > required to read the data written in the file. Meanwhile, if they are
> never
> > part of the core features we might generate a lot of files without the
> > related metadata hurting the read performance of the implementations
> where
> > the feature is implemented. (BTW for column indexes we already have the
> two
> > cross-tested implementations.)
> >
> > Introducing new logical types as "experimental" is a bit tricky. In
> > parquet-mr we do not give too much help to our consumers for logical
> types.
> > For example for a UUID logical type we still give the user a Binary
> object
> > instead of a java UUID. So waiting for other implementations does not
> > really make sense. I also don't think it would be a good idea to keep
> these
> > logical types "experimental" until some of our consumers start using
> them.
> > Nobody wants to start using and invest in experimental features.
> >
> > I think encodings are an easier topic as they aren't visible for the
> > outside (like the file schema). But we still have to list the ones we
> would
> > like to have in the "core features". From the parquet-mr point of view
> the
> > "V1" encodings are used widely while the "V2" ones are not, so I would
> move
> > the latter (RLE = 3, DELTA_BYTE_ARRAY = 7, DELTA_BINARY_PACKED = 5,
> > RLE_DICTIONARY = 8) to the experimental category.
> > (NOTE: For the parquet-mr implementation I would suggest to support the
> > experimental encodings by adding a property where the user can set the
> > encodings by column accepting the risk of being incompatible with other
> > implementations.)
> >
> > One more thing about the core features. If we finally agree on a set of
> > these how do we put new features in after a while? I mean how do we
> > "version" the list of core features? Semantic versioning can be an
> obvious
> > choice and we can even use the version of parquet-format. In this case we
> > enforce the implementers to implement all the new core features if they
> > want to keep track of the versions. Is that what we want?
> >
> >
> >
> > On Mon, Dec 7, 2020 at 6:16 AM Micah Kornfield <[email protected]>
> > wrote:
> >
> > > I tried to get some level of clarification in a PR [1].  It kind of
> > stalled
> > > because we had further conversation on a sync call a while ago that I
> > have
> > > not had a chance to follow-up on.  I'm happy to revise if we can come
> up
> > > with some sort of consensus for experimental/non-experimental.
> > >
> > > The things that were decided on the sync call were:
> > > 1.  Does it make sense to deprecate data page v2.  It sounded like this
> > > isn't widely used and the main benefits are either unused [uncompressed
> > > rep/def level], or are already required from other features (e.g.
> record
> > > level alignment within a data page is required with indices IIUC).
> > > 2.  Potentially accepting that all encodings in V1 and V2 are
> acceptable
> > > for V1 data pages.
> > >
> > > I think there is a lot more conversation here.  But I agree with
> > everything
> > > said on this thread we more clarity on what is required/not required
> for
> > > implementations.
> > >
> > >  e.g. I don't think the delta encoding reflects current best practices
> in
> > > > the literature.
> > >
> > >
> > > Could you expand on this?
> > >
> > >
> > > [1] https://github.com/apache/parquet-format/pull/163
> > >
> > > On Fri, Dec 4, 2020 at 3:57 PM Antoine Pitrou <[email protected]>
> > wrote:
> > >
> > > > On Fri, 4 Dec 2020 11:21:58 -0800
> > > > Tim Armstrong
> > > > <[email protected]> wrote:
> > > > > I probably didn't say it very clearly, but my opinion as a consumer
> > of
> > > > the
> > > > > Parquet spec is that the format needs a reset where encodings,
> > logical
> > > > > types and other metadata that are not widely adopted are removed
> from
> > > the
> > > > > core spec and put in an "experimental" category from which we can
> > later
> > > > > promote things based on community consensus.
> > > >
> > > > I'm not a participant in this community but I'm one of the people who
> > > > maintain parquet-cpp as part the Arrow C++ codebase, and your
> proposal
> > > > sounds like a very good idea to me.
> > > >
> > > > Even the core parts of Parquet can be tricky to implement
> efficiently,
> > > > adding more and more features and encodings is IMHO very dangerous
> for
> > > > the format's portability.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > >
> > > > > On Fri, Dec 4, 2020 at 11:14 AM Tim Armstrong <
> > [email protected]
> > > >
> > > > > wrote:
> > > > >
> > > > > > I think it would be good for the project to define a core set of
> > > > features
> > > > > > that a Parquet implementation must support to be able to
> correctly
> > > read
> > > > > > files all written by another compliant writer with the same
> > version.
> > > > > >
> > > > > > There are then additional extensions like page indices that are
> not
> > > > > > required to actually be able to read/write the files correctly.
> > > > > >
> > > > > > My understanding of how we got here is that the Parquet 2.0 spec
> > > > defined a
> > > > > > number of new features and encodings that were only implemented
> in
> > > > > > parquet-mr. Many of these have not gotten widely implemented, and
> > I'm
> > > > also
> > > > > > not sure that the encodings are particularly well designed, e.g.
> I
> > > > don't
> > > > > > think the delta encoding reflects current best practices in the
> > > > literature.
> > > > > >
> > > > > > E.g. in Impala we did not implement many of the new Parquet 2.0
> > > > encodings
> > > > > > because it was a significant amount of work and we had a lot of
> > > > competing
> > > > > > priorities. It's now a chicken and egg problem because there's
> > isn't
> > > > much
> > > > > > data encoded in that way out there and thus not much motivation
> to
> > > > invest
> > > > > > in the work. We've had some similar headaches with logic types
> and
> > > > > > encodings, e.g. where the standard allows several degrees of
> > freedom
> > > > in how
> > > > > > to encode DECIMALs, including ones that don't make much sense.
> > We've
> > > > wasted
> > > > > > significant time adding support for DECIMALs that were written in
> > > > strange
> > > > > > ways by another writer.
> > > > > >
> > > > > > In the meantime we did actually need to add other new features to
> > > > Parquet
> > > > > > like page indices, so moved forward with those, which is part of
> > how
> > > > it's
> > > > > > turned into a bit of a matrix.
> > > > > >
> > > > > > My suggestion is to go through the encodings and features and
> flag
> > > the
> > > > > > ones that are not universally implemented as "EXPERIMENTAL" or
> > > > something
> > > > > > similar, and make it clear that they are optional and may be
> > > > deprecated if
> > > > > > they don't get adopted. I think that would give implementations a
> > > > cleaner
> > > > > > baseline to compare against, then we can start adding core
> features
> > > on
> > > > top
> > > > > > of that.
> > > > > >
> > > > > > IMO we shouldn't promote features to being part of the core spec
> > > until
> > > > > > there are at least 2 reference implementations that work with
> each
> > > > other.
> > > > > > That way we can use the reference implementations to clarify any
> > bugs
> > > > in
> > > > > > the spec.
> > > > > >
> > > > > > This is all kinda hard and a little thankless so I don't blame
> > anyone
> > > > for
> > > > > > not taking it on yet!
> > > > > >
> > > > > > > * The thrift definition says that this is "Byte offset in
> > file_path
> > > > to
> > > > > > the ColumnMetaData" but this doesn't make too much sense, as the
> > > > > > ColumnMetaData is contained inside the ColumnChunk itself (and
> > > > therefore,
> > > > > > you cannot even know its offset when writing the column chunk).
> > > > > > It probably makes sense to update the spec to reflect what actual
> > > > > > implementations do.
> > > > > >
> > > > > > > * This is defined to be int16_t, which can quickly overflow for
> > > > very
> > > > > > large parquet files or smaller row groups. What should I do if I
> > > > anticipate
> > > > > > that my library will be used to write files where this will
> > overflow?
> > > > Just
> > > > > > use it for the first 2^15 row groups and then leave it out? Or
> > don't
> > > > write
> > > > > > it at all for any row group?
> > > > > >
> > > > > > 16k+ row groups in a file seems goofy to me. I can't think of a
> way
> > > > that
> > > > > > you could hit that limit with a reasonable file layout. Row
> groups
> > > are
> > > > > > meant to be big enough that you amortise the metadata overhead
> and
> > > get
> > > > good
> > > > > > compression over a column chunk. I'd expect row groups to be 10s
> of
> > > MB
> > > > at
> > > > > > least typically. Maybe this would be something that should be
> > > > clarified in
> > > > > > the spec or other docs.
> > > > > >
> > > > > >
> > > > > > On Fri, Oct 16, 2020 at 3:46 PM Micah Kornfield <
> > > [email protected]
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > >> >
> > > > > >> > IMHO, shouldn't the spec mention - quite precisely - what
> > versions
> > > > exist
> > > > > >> > and what features can be used in which version, so an
> > > > implementation can
> > > > > >> > say "yes, I can fully write this versions" or "no, I can't"
> > > instead
> > > > of
> > > > > >> > having a fuzzy set of features where some are described to
> "not
> > > > work on
> > > > > >> > most readers". Even if such a precise mapping isn't defined
> > today,
> > > > > >> > shouldn't it be defined at least in retrospect, so that
> > > > implementations
> > > > > >> can
> > > > > >> > start checking and documenting, what versions they can write?
> > > > > >>
> > > > > >>
> > > > > >> Just an FYI I've already raised these concerns recently [1].
> But
> > I
> > > > > >> completely agree this is needed, I'd offer to update the docs,
> > but I
> > > > don't
> > > > > >> really have the historical context to do it.
> > > > > >>
> > > > > >> [1]
> > > > > >>
> > > > > >>
> > > >
> > >
> >
> https://mail-archives.apache.org/mod_mbox/parquet-dev/202010.mbox/%3CCAK7Z5T8c4Zpwj_yWZ9Kr4gTueGrMq%2B0NAqADJ7rnkBaUqtPwuQ%40mail.gmail.com%3E
> > > > > >>
> > > > > >> On Fri, Oct 16, 2020 at 3:27 PM Jan Finis
> > > <[email protected]
> > > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Hey folks,
> > > > > >> >
> > > > > >> > First of all, thanks for this great project!
> > > > > >> >
> > > > > >> > I am currently writing a library for reading/writing parquet,
> > and
> > > I
> > > > am a
> > > > > >> > bit confused by some points, which I would like to discuss
> > here. I
> > > > think
> > > > > >> > they will be relevant to anyone wanting to write own parquet
> > > > > >> > reading/writing logic in a new language.
> > > > > >> >
> > > > > >> >
> > > > > >> > MetaData:
> > > > > >> >
> > > > > >> > There are some fields in the metadata whose semantics is
> > unclear.
> > > > Can
> > > > > >> you
> > > > > >> > clarify this:
> > > > > >> >
> > > > > >> > ColumnChunk.file_offset:
> > > > > >> >
> > > > > >> > * The thrift definition says that this is "Byte offset in
> > > file_path
> > > > to
> > > > > >> the
> > > > > >> > ColumnMetaData" but this doesn't make too much sense, as the
> > > > > >> ColumnMetaData
> > > > > >> > is contained inside the ColumnChunk itself (and therefore, you
> > > > cannot
> > > > > >> even
> > > > > >> > know its offset when writing the column chunk).
> > > > > >> > parquet-mr seems to just write the offset of the first
> data/dict
> > > > page
> > > > > >> into
> > > > > >> > this field, which doesn't seem to comply with what the spec
> says
> > > > (but
> > > > > >> is at
> > > > > >> > least possible). What is this field supposed to be used for?
> My
> > > > reader
> > > > > >> just
> > > > > >> > ignores it, but my writer should make sure to write the most
> > > > sensible
> > > > > >> value
> > > > > >> > in here, in case some reader relies on it.
> > > > > >> >
> > > > > >> > RowGroup.ordinal:
> > > > > >> >
> > > > > >> > * This seems to be just the ordinal of the row group. However,
> > > this
> > > > > >> > information seems redundant, as the row-groups meta data is
> > > > already
> > > > > >> stored
> > > > > >> > in a specific order in the footer. Should the ordinal just be
> > > equal
> > > > to
> > > > > >> that
> > > > > >> > order, or can it differ?
> > > > > >> > * Will any reader rely on this, since it's optional?
> > > > > >> > * This is defined to be int16_t, which can quickly overflow
> for
> > > > very
> > > > > >> large
> > > > > >> > parquet files or smaller row groups. What should I do if I
> > > > anticipate
> > > > > >> that
> > > > > >> > my library will be used to write files where this will
> overflow?
> > > > Just
> > > > > >> use
> > > > > >> > it for the first 2^15 row groups and then leave it out? Or
> don't
> > > > write
> > > > > >> it
> > > > > >> > at all for any row group?
> > > > > >> >
> > > > > >> > Compatibility:
> > > > > >> >
> > > > > >> >
> > > > > >> >   1.  For writing: What compatibility versions are there? The
> > spec
> > > > > >> talks a
> > > > > >> > lot about compatibility and features that not all readers can
> > > read,
> > > > but
> > > > > >> it
> > > > > >> > never specifies things like "this encoding is version X.Y and
> > > > upwards".
> > > > > >> > So, when writing a parquet file, I have some problems in
> > choosing
> > > > > >> features.
> > > > > >> > E.g.,
> > > > > >> > * should I write DataPageV1 or DataPageV2?
> > > > > >> > * Should I use DELTA_BYTE_ARRAY/DELTA_BINARY_PACKED?
> > > > > >> > * Should I use BYTE_STREAM_SPLIT?
> > > > > >> >
> > > > > >> >   2.  For reading: I have implemented all encodings except
> > > > BIT_PACKED,
> > > > > >> > which seems deprecated for a long time (and would require all
> > the
> > > > > >> > bitunpack-on-big-endian logic, which would be a lot of work).
> > How
> > > > safe
> > > > > >> can
> > > > > >> > I be that this encoding is no longer used? When was it last
> > used?
> > > > Since
> > > > > >> > when is it deprecated?
> > > > > >> >
> > > > > >> >
> > > > > >> > IMHO, shouldn't the spec mention - quite precisely - what
> > versions
> > > > exist
> > > > > >> > and what features can be used in which version, so an
> > > > implementation can
> > > > > >> > say "yes, I can fully write this versions" or "no, I can't"
> > > instead
> > > > of
> > > > > >> > having a fuzzy set of features where some are described to
> "not
> > > > work on
> > > > > >> > most readers". Even if such a precise mapping isn't defined
> > today,
> > > > > >> > shouldn't it be defined at least in retrospect, so that
> > > > implementations
> > > > > >> can
> > > > > >> > start checking and documenting, what versions they can write?
> > > > > >> >
> > > > > >> > To give you an example where I already encountered this:
> > > parquet-cpp
> > > > > >> > doesn't seem to be able to read DELTA_BINARY_PACKED yet, so
> any
> > > > software
> > > > > >> > built on top of it (e.g., pyarrow) cannot read such files. I
> > > > created
> > > > > >> such a
> > > > > >> > file with parquet-mr and was very surprised when pyarrow
> > couldn't
> > > > read
> > > > > >> it.
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > Source of truth:
> > > > > >> >
> > > > > >> > In general, what is the agreed-upon source of truth for
> parquet?
> > > Is
> > > > it
> > > > > >> the
> > > > > >> > documents parquet-format, or is it the implementation in
> > > > parquet-mr?
> > > > > >> These
> > > > > >> > differ sometimes, so which one should I adhere to if they do?
> > > > > >> >
> > > > > >> >
> > > > > >> > Thanks in advance for any answers.
> > > > > >> > Cheers,
> > > > > >> > Jan
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Re: Parquet File Meta Data & Compatibility

Reply via email to