Re: Parquet File Meta Data & Compatibility

Tim Armstrong Mon, 07 Dec 2020 12:00:25 -0800

> Introducing new logical types as "experimental" is a bit tricky.
Maybe experimental is a bad term. I think mostly new features in the format
do need to be backwards compatible and not buggy because data lasts a long
time once it's written. Maybe "incubating" or "preview" is a better term. I
guess it's important that a) they are not automatically enabled so users
don't accidentally write out data other tools can't read and b) that the
format is clear that readers implementing version x don't need to be able
to read data that uses incubating features from version x in order to be
compliant.


>One more thing about the core features. If we finally agree on a set of
> these how do we put new features in after a while? I mean how do we
> "version" the list of core features? Semantic versioning can be an obvious
> choice and we can even use the version of parquet-format. In this case we
> enforce the implementers to implement all the new core features if they
> want to keep track of the versions. Is that what we want?
I think that would be an improvement from my point of view if we released a
new version of parquet-format with fewer core features but clear wording
that the core features *must* be implemented. That could add a new stick
for implementors - they are not in compliance with the standard if they
can't read all core feature or if they write non-compliant data. But
there's also a carrot - you can claim to be in compliance with the standard
and you don't need to deal with as many edge cases from other writers.

E.g. for Impala it would be great for us to know exactly what our gaps are
in reading, instead of having to add new support when it comes up. But I
think we'd have to do some things to be better Parquet citizens, e.g.
writing int64 instead of int96 timestamps by default.


> Could you expand on this?
I was probably overstating this. I took another look and it seems like it's
probably still a defensible decision, but I'm not sure that it's the ideal
choice. It gets decent compression, it's fast to encode/decode with SIMD
and it's simpler than some alternatives. It was made 8 years ago and I'm
not sure of the exact rationale. -
https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5.


The paper it's taken from actually shows that some other encodings
that handle exceptional values via patching (their PFOR schemes) achieve
better compression on realistic data.
https://people.csail.mit.edu/jshun/6886-s19/lectures/lecture19-1.pdf has a
more accessible summary of this. I thought there had been some more
improvements in the literature since 2012 but I wasn't able to find much so
may have been misremembering. I did come across this paper about
implementing the Parquet delta decoding more efficiently, but that didn't
require changes to the format -
https://newtraell.cs.uchicago.edu/files/ms_paper/hajiang.pdf.

On Mon, Dec 7, 2020 at 7:57 AM Gabor Szadovszky <[email protected]> wrote:

> I agree on separating the non widely used features to make the life of the
> implementers easier and to improve compatibility between these
> implementations.
> Meanwhile, it is not always clear how to define the core features. For
> example, the encryption feature will be released soon in parquet-mr and I
> guess parquet-cpp as well (if not released yet). Would it be part of the
> core features already?
>
> Another thing is the compression. Which one would be part of the core
> features and which one would we like to drop (moving to experimental)?
>
> We can skip features like column indexes from the core ones as they are not
> required to read the data written in the file. Meanwhile, if they are never
> part of the core features we might generate a lot of files without the
> related metadata hurting the read performance of the implementations where
> the feature is implemented. (BTW for column indexes we already have the two
> cross-tested implementations.)
>
> Introducing new logical types as "experimental" is a bit tricky. In
> parquet-mr we do not give too much help to our consumers for logical types.
> For example for a UUID logical type we still give the user a Binary object
> instead of a java UUID. So waiting for other implementations does not
> really make sense. I also don't think it would be a good idea to keep these
> logical types "experimental" until some of our consumers start using them.
> Nobody wants to start using and invest in experimental features.
>
> I think encodings are an easier topic as they aren't visible for the
> outside (like the file schema). But we still have to list the ones we would
> like to have in the "core features". From the parquet-mr point of view the
> "V1" encodings are used widely while the "V2" ones are not, so I would move
> the latter (RLE = 3, DELTA_BYTE_ARRAY = 7, DELTA_BINARY_PACKED = 5,
> RLE_DICTIONARY = 8) to the experimental category.
> (NOTE: For the parquet-mr implementation I would suggest to support the
> experimental encodings by adding a property where the user can set the
> encodings by column accepting the risk of being incompatible with other
> implementations.)
>
> One more thing about the core features. If we finally agree on a set of
> these how do we put new features in after a while? I mean how do we
> "version" the list of core features? Semantic versioning can be an obvious
> choice and we can even use the version of parquet-format. In this case we
> enforce the implementers to implement all the new core features if they
> want to keep track of the versions. Is that what we want?
>
>
>
> On Mon, Dec 7, 2020 at 6:16 AM Micah Kornfield <[email protected]>
> wrote:
>
> > I tried to get some level of clarification in a PR [1].  It kind of
> stalled
> > because we had further conversation on a sync call a while ago that I
> have
> > not had a chance to follow-up on.  I'm happy to revise if we can come up
> > with some sort of consensus for experimental/non-experimental.
> >
> > The things that were decided on the sync call were:
> > 1.  Does it make sense to deprecate data page v2.  It sounded like this
> > isn't widely used and the main benefits are either unused [uncompressed
> > rep/def level], or are already required from other features (e.g. record
> > level alignment within a data page is required with indices IIUC).
> > 2.  Potentially accepting that all encodings in V1 and V2 are acceptable
> > for V1 data pages.
> >
> > I think there is a lot more conversation here.  But I agree with
> everything
> > said on this thread we more clarity on what is required/not required for
> > implementations.
> >
> >  e.g. I don't think the delta encoding reflects current best practices in
> > > the literature.
> >
> >
> > Could you expand on this?
> >
> >
> > [1] https://github.com/apache/parquet-format/pull/163
> >
> > On Fri, Dec 4, 2020 at 3:57 PM Antoine Pitrou <[email protected]>
> wrote:
> >
> > > On Fri, 4 Dec 2020 11:21:58 -0800
> > > Tim Armstrong
> > > <[email protected]> wrote:
> > > > I probably didn't say it very clearly, but my opinion as a consumer
> of
> > > the
> > > > Parquet spec is that the format needs a reset where encodings,
> logical
> > > > types and other metadata that are not widely adopted are removed from
> > the
> > > > core spec and put in an "experimental" category from which we can
> later
> > > > promote things based on community consensus.
> > >
> > > I'm not a participant in this community but I'm one of the people who
> > > maintain parquet-cpp as part the Arrow C++ codebase, and your proposal
> > > sounds like a very good idea to me.
> > >
> > > Even the core parts of Parquet can be tricky to implement efficiently,
> > > adding more and more features and encodings is IMHO very dangerous for
> > > the format's portability.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > >
> > > > On Fri, Dec 4, 2020 at 11:14 AM Tim Armstrong <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > > > I think it would be good for the project to define a core set of
> > > features
> > > > > that a Parquet implementation must support to be able to correctly
> > read
> > > > > files all written by another compliant writer with the same
> version.
> > > > >
> > > > > There are then additional extensions like page indices that are not
> > > > > required to actually be able to read/write the files correctly.
> > > > >
> > > > > My understanding of how we got here is that the Parquet 2.0 spec
> > > defined a
> > > > > number of new features and encodings that were only implemented in
> > > > > parquet-mr. Many of these have not gotten widely implemented, and
> I'm
> > > also
> > > > > not sure that the encodings are particularly well designed, e.g. I
> > > don't
> > > > > think the delta encoding reflects current best practices in the
> > > literature.
> > > > >
> > > > > E.g. in Impala we did not implement many of the new Parquet 2.0
> > > encodings
> > > > > because it was a significant amount of work and we had a lot of
> > > competing
> > > > > priorities. It's now a chicken and egg problem because there's
> isn't
> > > much
> > > > > data encoded in that way out there and thus not much motivation to
> > > invest
> > > > > in the work. We've had some similar headaches with logic types and
> > > > > encodings, e.g. where the standard allows several degrees of
> freedom
> > > in how
> > > > > to encode DECIMALs, including ones that don't make much sense.
> We've
> > > wasted
> > > > > significant time adding support for DECIMALs that were written in
> > > strange
> > > > > ways by another writer.
> > > > >
> > > > > In the meantime we did actually need to add other new features to
> > > Parquet
> > > > > like page indices, so moved forward with those, which is part of
> how
> > > it's
> > > > > turned into a bit of a matrix.
> > > > >
> > > > > My suggestion is to go through the encodings and features and flag
> > the
> > > > > ones that are not universally implemented as "EXPERIMENTAL" or
> > > something
> > > > > similar, and make it clear that they are optional and may be
> > > deprecated if
> > > > > they don't get adopted. I think that would give implementations a
> > > cleaner
> > > > > baseline to compare against, then we can start adding core features
> > on
> > > top
> > > > > of that.
> > > > >
> > > > > IMO we shouldn't promote features to being part of the core spec
> > until
> > > > > there are at least 2 reference implementations that work with each
> > > other.
> > > > > That way we can use the reference implementations to clarify any
> bugs
> > > in
> > > > > the spec.
> > > > >
> > > > > This is all kinda hard and a little thankless so I don't blame
> anyone
> > > for
> > > > > not taking it on yet!
> > > > >
> > > > > > * The thrift definition says that this is "Byte offset in
> file_path
> > > to
> > > > > the ColumnMetaData" but this doesn't make too much sense, as the
> > > > > ColumnMetaData is contained inside the ColumnChunk itself (and
> > > therefore,
> > > > > you cannot even know its offset when writing the column chunk).
> > > > > It probably makes sense to update the spec to reflect what actual
> > > > > implementations do.
> > > > >
> > > > > > * This is defined to be int16_t, which can quickly overflow for
> > > very
> > > > > large parquet files or smaller row groups. What should I do if I
> > > anticipate
> > > > > that my library will be used to write files where this will
> overflow?
> > > Just
> > > > > use it for the first 2^15 row groups and then leave it out? Or
> don't
> > > write
> > > > > it at all for any row group?
> > > > >
> > > > > 16k+ row groups in a file seems goofy to me. I can't think of a way
> > > that
> > > > > you could hit that limit with a reasonable file layout. Row groups
> > are
> > > > > meant to be big enough that you amortise the metadata overhead and
> > get
> > > good
> > > > > compression over a column chunk. I'd expect row groups to be 10s of
> > MB
> > > at
> > > > > least typically. Maybe this would be something that should be
> > > clarified in
> > > > > the spec or other docs.
> > > > >
> > > > >
> > > > > On Fri, Oct 16, 2020 at 3:46 PM Micah Kornfield <
> > [email protected]
> > > >
> > > > > wrote:
> > > > >
> > > > >> >
> > > > >> > IMHO, shouldn't the spec mention - quite precisely - what
> versions
> > > exist
> > > > >> > and what features can be used in which version, so an
> > > implementation can
> > > > >> > say "yes, I can fully write this versions" or "no, I can't"
> > instead
> > > of
> > > > >> > having a fuzzy set of features where some are described to "not
> > > work on
> > > > >> > most readers". Even if such a precise mapping isn't defined
> today,
> > > > >> > shouldn't it be defined at least in retrospect, so that
> > > implementations
> > > > >> can
> > > > >> > start checking and documenting, what versions they can write?
> > > > >>
> > > > >>
> > > > >> Just an FYI I've already raised these concerns recently [1].  But
> I
> > > > >> completely agree this is needed, I'd offer to update the docs,
> but I
> > > don't
> > > > >> really have the historical context to do it.
> > > > >>
> > > > >> [1]
> > > > >>
> > > > >>
> > >
> >
> https://mail-archives.apache.org/mod_mbox/parquet-dev/202010.mbox/%3CCAK7Z5T8c4Zpwj_yWZ9Kr4gTueGrMq%2B0NAqADJ7rnkBaUqtPwuQ%40mail.gmail.com%3E
> > > > >>
> > > > >> On Fri, Oct 16, 2020 at 3:27 PM Jan Finis
> > <[email protected]
> > > >
> > > > >> wrote:
> > > > >>
> > > > >> > Hey folks,
> > > > >> >
> > > > >> > First of all, thanks for this great project!
> > > > >> >
> > > > >> > I am currently writing a library for reading/writing parquet,
> and
> > I
> > > am a
> > > > >> > bit confused by some points, which I would like to discuss
> here. I
> > > think
> > > > >> > they will be relevant to anyone wanting to write own parquet
> > > > >> > reading/writing logic in a new language.
> > > > >> >
> > > > >> >
> > > > >> > MetaData:
> > > > >> >
> > > > >> > There are some fields in the metadata whose semantics is
> unclear.
> > > Can
> > > > >> you
> > > > >> > clarify this:
> > > > >> >
> > > > >> > ColumnChunk.file_offset:
> > > > >> >
> > > > >> > * The thrift definition says that this is "Byte offset in
> > file_path
> > > to
> > > > >> the
> > > > >> > ColumnMetaData" but this doesn't make too much sense, as the
> > > > >> ColumnMetaData
> > > > >> > is contained inside the ColumnChunk itself (and therefore, you
> > > cannot
> > > > >> even
> > > > >> > know its offset when writing the column chunk).
> > > > >> > parquet-mr seems to just write the offset of the first data/dict
> > > page
> > > > >> into
> > > > >> > this field, which doesn't seem to comply with what the spec says
> > > (but
> > > > >> is at
> > > > >> > least possible). What is this field supposed to be used for? My
> > > reader
> > > > >> just
> > > > >> > ignores it, but my writer should make sure to write the most
> > > sensible
> > > > >> value
> > > > >> > in here, in case some reader relies on it.
> > > > >> >
> > > > >> > RowGroup.ordinal:
> > > > >> >
> > > > >> > * This seems to be just the ordinal of the row group. However,
> > this
> > > > >> > information seems redundant, as the row-groups meta data is
> > > already
> > > > >> stored
> > > > >> > in a specific order in the footer. Should the ordinal just be
> > equal
> > > to
> > > > >> that
> > > > >> > order, or can it differ?
> > > > >> > * Will any reader rely on this, since it's optional?
> > > > >> > * This is defined to be int16_t, which can quickly overflow for
> > > very
> > > > >> large
> > > > >> > parquet files or smaller row groups. What should I do if I
> > > anticipate
> > > > >> that
> > > > >> > my library will be used to write files where this will overflow?
> > > Just
> > > > >> use
> > > > >> > it for the first 2^15 row groups and then leave it out? Or don't
> > > write
> > > > >> it
> > > > >> > at all for any row group?
> > > > >> >
> > > > >> > Compatibility:
> > > > >> >
> > > > >> >
> > > > >> >   1.  For writing: What compatibility versions are there? The
> spec
> > > > >> talks a
> > > > >> > lot about compatibility and features that not all readers can
> > read,
> > > but
> > > > >> it
> > > > >> > never specifies things like "this encoding is version X.Y and
> > > upwards".
> > > > >> > So, when writing a parquet file, I have some problems in
> choosing
> > > > >> features.
> > > > >> > E.g.,
> > > > >> > * should I write DataPageV1 or DataPageV2?
> > > > >> > * Should I use DELTA_BYTE_ARRAY/DELTA_BINARY_PACKED?
> > > > >> > * Should I use BYTE_STREAM_SPLIT?
> > > > >> >
> > > > >> >   2.  For reading: I have implemented all encodings except
> > > BIT_PACKED,
> > > > >> > which seems deprecated for a long time (and would require all
> the
> > > > >> > bitunpack-on-big-endian logic, which would be a lot of work).
> How
> > > safe
> > > > >> can
> > > > >> > I be that this encoding is no longer used? When was it last
> used?
> > > Since
> > > > >> > when is it deprecated?
> > > > >> >
> > > > >> >
> > > > >> > IMHO, shouldn't the spec mention - quite precisely - what
> versions
> > > exist
> > > > >> > and what features can be used in which version, so an
> > > implementation can
> > > > >> > say "yes, I can fully write this versions" or "no, I can't"
> > instead
> > > of
> > > > >> > having a fuzzy set of features where some are described to "not
> > > work on
> > > > >> > most readers". Even if such a precise mapping isn't defined
> today,
> > > > >> > shouldn't it be defined at least in retrospect, so that
> > > implementations
> > > > >> can
> > > > >> > start checking and documenting, what versions they can write?
> > > > >> >
> > > > >> > To give you an example where I already encountered this:
> > parquet-cpp
> > > > >> > doesn't seem to be able to read DELTA_BINARY_PACKED yet, so any
> > > software
> > > > >> > built on top of it (e.g., pyarrow) cannot read such files. I
> > > created
> > > > >> such a
> > > > >> > file with parquet-mr and was very surprised when pyarrow
> couldn't
> > > read
> > > > >> it.
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > Source of truth:
> > > > >> >
> > > > >> > In general, what is the agreed-upon source of truth for parquet?
> > Is
> > > it
> > > > >> the
> > > > >> > documents parquet-format, or is it the implementation in
> > > parquet-mr?
> > > > >> These
> > > > >> > differ sometimes, so which one should I adhere to if they do?
> > > > >> >
> > > > >> >
> > > > >> > Thanks in advance for any answers.
> > > > >> > Cheers,
> > > > >> > Jan
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> > >
> > >
> > >
> >
>

Re: Parquet File Meta Data & Compatibility

Reply via email to