Re: Current status of Data Page V2?

Gabor Szadovszky Mon, 26 Oct 2020 01:49:02 -0700

Hi Micah,

V2 pages are not only about the new encodings but a couple of other things
the community thought would be better than V1.
One of these improvements was to break pages at row boundaries. This one is
outdated because during the development of column-indexes we had to
implement the same for V1 pages.
Another improvement is to store the repetition and definition levels
separately so they can be read without decompressing data. I don't know if
we still think it has significant benefits but we never implemented
anything that makes use of it.
The default encodings are also differences in V1 and V2 pages but they are
not properly specified. Also, the encodings are not tied to V1 or V2 by
design. It is only a matter of specification.


I don't think that anything is blocking V2 pages to be introduced in
production (I believe there are a couple of users who already use V2 pages
in production) but I think it isn't worth the effort. As far as I know
parquet-mr is the only one that supports V2 pages.

I agree it is complicated to introduce new encodings and keep backward
compatibility but this is how compression codecs work currently. Supporting
compression codecs is not only about supporting a specific parquet-format
release or having a specific parquet-mr release but you also have to have
codec implementation available. We might handle the encodings similarly
that there are default ones guaranteed would work and there are others that
might not be supported by older readers.
But this is only an example. We might want to introduce some version levels
or compatibility groups. What I want to say there is no reason to tie the
new codecs to V2 pages. We should handle the introduction of the new codecs
(and handling their compatibility) and the V2 pages separately.

It would be nice to hear other opinions on this topic.

Regards,
Gabor

On Thu, Oct 22, 2020 at 5:54 PM Micah Kornfield <[email protected]>
wrote:

> Hi Gabor,
>
> > It is still not clear to me if we want to recommend V2 for production use
> > at all
>
> Again, I'm missing context here, but what is blocking V2 for production
> use?  Is it specification finalization, implementation finalization?
> Something else?
>
> or simply introduce the new encodings for V1.
>
>
> This doesn't seem simple to me.  It will break V1 readers that don't
> support these encodings. It also sounds like this is taking the frame or
> reference of parquet-mr, and not Parquet as a specification.  My main goal
> in this thread is to try to capture information, so we can get better
> specification docs (and to potentially help drive consensus on V2).
>
> There was another recent e-mail thread [1] that also pointed out our gap in
> explaining how versioning works and what is expected to be included in
> compliant reader/writers (if there are docs like this I apologize if I
> missed them and would appreciate pointers).
>
>  I would suggest
> > discussing this topic on the parquet sync next Tuesday.
>
>
> I will try to make it but I think I might have a conflict.  In this case I
> think continuing the email thread would be useful, so we have a good
> historical snapshot of the discussion (a lot of context gets lost in sync
> notes in my experience).
>
> [1]
>
> https://mail-archives.apache.org/mod_mbox/parquet-dev/202010.mbox/%3CMWHPR03MB31842468BE51022392B7F2D1CE020%40MWHPR03MB3184.namprd03.prod.outlook.com%3E
>
>
> On Thursday, October 22, 2020, Gabor Szadovszky <[email protected]> wrote:
>
> > It is still not clear to me if we want to recommend V2 for production use
> > at all or simply introduce the new encodings for V1. I would suggest
> > discussing this topic on the parquet sync next Tuesday.
> >
> > On Thu, Oct 22, 2020 at 6:04 AM Micah Kornfield <[email protected]>
> > wrote:
> >
> > > I've created https://github.com/apache/parquet-format/pull/163 to try
> to
> > > document these (note I really don't have historical context here so
> > please
> > > review carefully).
> > >
> > > I would appreciate it if someone could point me to a reference on what
> > the
> > > current status of V2 is?  What is left unsettled? When can we start
> > > recommending it for production use?
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Tue, Oct 13, 2020 at 9:23 AM Micah Kornfield <[email protected]
> >
> > > wrote:
> > >
> > > > I am not sure 2.0 means the v2 pages here. I think there was/is a bit
> > of
> > > >> confusion between the v1/v2 pages and the parquet-mr releases. Maybe
> > the
> > > >> parquet-format releases are also part of it.
> > > >
> > > >
> > > > +1 to the confusion part.  The reason why I originally started this
> > > thread
> > > > is that none of this is entirely clear to me from existing
> > documentation.
> > > >
> > > > In particular it is confusing to me to say that the V2 Spec is not
> yet
> > > > finished when it looks like there have been multiple V2 Format
> > releases.
> > > >
> > > > It would be extremely useful to have documentation relating features
> > to:
> > > > 1.  The version of the spec they are part of
> > > > 2.  There current status in reference implementations
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > >
> > > > On Tue, Oct 13, 2020 at 1:51 AM Gabor Szadovszky <[email protected]>
> > > wrote:
> > > >
> > > >> I am not sure 2.0 means the v2 pages here. I think there was/is a
> bit
> > of
> > > >> confusion between the v1/v2 pages and the parquet-mr releases. Maybe
> > the
> > > >> parquet-format releases are also part of it.
> > > >> In this table many features are not related to the pages so I don't
> > > think
> > > >> the "Expected release" meant the v1/v2 pages. I guess there was an
> > > earlier
> > > >> plan to release parquet-mr 2.0 with the v2 pages but then v2 pages
> > were
> > > >> released in a 1.x release while 2.0 is not planned yet. (I was not
> in
> > > the
> > > >> community that time so I'm only guessing.)
> > > >>
> > > >> Also worth to mention that it seems to be not related to the
> > > >> parquet-format
> > > >> releases which means that based on the spec the implementations
> > were/are
> > > >> not limited by this table.
> > > >>
> > > >>
> > > >> On Mon, Oct 12, 2020 at 6:53 PM Ryan Blue <[email protected]
> >
> > > >> wrote:
> > > >>
> > > >> > I remembered that there used to be a table. Looks like it was
> > removed:
> > > >> >
> > > >> >
> > > >>
> > >
> >
> https://github.com/apache/parquet-mr/commit/aed9097640c7adffe1151b32e86b5efc3702c657?short_path=b335630#diff-04c6e90faac2675aa89e2176d2eec7d8
> > > >> >
> > > >> > The table used to list delta as a 2.0 feature.
> > > >> >
> > > >> > On Mon, Oct 12, 2020 at 1:38 AM Gabor Szadovszky <
> [email protected]>
> > > >> wrote:
> > > >> >
> > > >> > > That answer I wrote to the other thread was based on the current
> > > >> code. So
> > > >> > > that is how parquet-mr is working now. It does not mean though
> how
> > > >> shall
> > > >> > it
> > > >> > > work or how it works in other implementations. Unfortunately,
> the
> > > spec
> > > >> > does
> > > >> > > not say anything about v1 and v2 in the context of encodings.
> > > >> > > Meanwhile, enabling the "new" encodings in v1 may generate
> > > >> compatibility
> > > >> > > issues with other implementations. (I am not sure how would the
> > > >> existing
> > > >> > > releases of parquet-mr behave if they have to read v1 pages with
> > > these
> > > >> > > encodings but I believe it would work fine.)
> > > >> > >
> > > >> > > I think, it would be a good idea to keep the existing default
> > > >> behavior as
> > > >> > > is but introduce some new flags where the user may set/suggest
> > > >> encodings
> > > >> > > for the different columns. This way the user can hold the risk
> of
> > > >> being
> > > >> > > potentially incompatible with other implementations (for the
> time
> > > >> being)
> > > >> > > and also can fine tune the encodings for the data. This way we
> can
> > > >> also
> > > >> > > introduce some new encodings that are better in some cases (e.g.
> > > lossy
> > > >> > > compression for floating point numbers).
> > > >> > >
> > > >> > > What do you guys thing?
> > > >> > > (I would be happy to help anyone would like to contribute in
> this
> > > >> topic.)
> > > >> > >
> > > >> > > Cheers,
> > > >> > > Gabor
> > > >> > >
> > > >> > > On Sat, Oct 10, 2020 at 2:18 AM Jacques Nadeau <
> > [email protected]>
> > > >> > wrote:
> > > >> > >
> > > >> > > > Gabor seems to agree that delta is V2 only.
> > > >> > > >
> > > >> > > > To summarize, no delta encodings are used for V1 pages. They
> are
> > > >> > > available
> > > >> > > > > for V2 only.
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > https://www.mail-archive.com/[email protected]/msg11826.html
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau <
> > [email protected]
> > > >
> > > >> > > wrote:
> > > >> > > >
> > > >> > > > > Good point. I had mentally categorized this as V2, not based
> > on
> > > >> the
> > > >> > > docs?
> > > >> > > > >
> > > >> > > > > I don't think most tools write this but I can't see anywhere
> > > that
> > > >> it
> > > >> > > says
> > > >> > > > > it is limited to v2 readers/writers. I'm not sure how many
> > tools
> > > >> > > > vectorize
> > > >> > > > > read it versus delegate to the legacy mr path (at least in
> > > java),
> > > >> > > either.
> > > >> > > > >
> > > >> > > > > On Fri, Oct 9, 2020 at 2:25 PM Micah Kornfield <
> > > >> > [email protected]>
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > >> The big win in v2 pages (if I remember correctly) is that
> the
> > > >> > variable
> > > >> > > > >>> length encoding is no longer interleaved. That would
> > provide a
> > > >> big
> > > >> > > > >>> performance lift when pulling into arrow vectors (and
> > variable
> > > >> > length
> > > >> > > > >>> decoding typically dominates total read processing time,
> on
> > > >> average
> > > >> > > > I've
> > > >> > > > >>> seen 5-10x per cell cpu cost increase for variable reads
> > over
> > > >> > scalar
> > > >> > > > >>> reads). AFAIK, there is still no option for that in V1.
> > > >> > > > >>
> > > >> > > > >>
> > > >> > > > >> For Delta-length byte the documentation [1]   states "This
> > > >> encoding
> > > >> > is
> > > >> > > > >> always preferred over PLAIN for byte array columns." makes
> it
> > > >> sound
> > > >> > > like
> > > >> > > > >> this could be part of V1 or V2.   The encoding enum in the
> > > Thrift
> > > >> > file
> > > >> > > > [2]
> > > >> > > > >> doesn't seem to document this either.
> > > >> > > > >>
> > > >> > > > >> Is there clearer documentation for what encodings are
> > > considered
> > > >> > part
> > > >> > > of
> > > >> > > > >> v1 vs v2?
> > > >> > > > >>
> > > >> > > > >> Thanks,
> > > >> > > > >> Micah
> > > >> > > > >>
> > > >> > > > >>
> > > >> > > > >> [1]
> > > >> > > > >>
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
> > > >> > > > >> [2]
> > > >> > > > >>
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407
> > > >> > > > >>
> > > >> > > > >> On Fri, Oct 9, 2020 at 11:37 AM Jacques Nadeau <
> > > >> [email protected]>
> > > >> > > > >> wrote:
> > > >> > > > >>
> > > >> > > > >>> The big win in v2 pages (if I remember correctly) is that
> > the
> > > >> > > variable
> > > >> > > > >>> length encoding is no longer interleaved. That would
> > provide a
> > > >> big
> > > >> > > > >>> performance lift when pulling into arrow vectors (and
> > variable
> > > >> > length
> > > >> > > > >>> decoding typically dominates total read processing time,
> on
> > > >> average
> > > >> > > > I've
> > > >> > > > >>> seen 5-10x per cell cpu cost increase for variable reads
> > over
> > > >> > scalar
> > > >> > > > >>> reads). AFAIK, there is still no option for that in V1.
> > > >> > > > >>>
> > > >> > > > >>> On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield <
> > > >> > > [email protected]
> > > >> > > > >
> > > >> > > > >>> wrote:
> > > >> > > > >>>
> > > >> > > > >>>> Thanks for the quick reply Ryan.
> > > >> > > > >>>>
> > > >> > > > >>>>
> > > >> > > > >>>> > We only use v1 and it still works well. That said, I'd
> > love
> > > >> to
> > > >> > > make
> > > >> > > > >>>> some
> > > >> > > > >>>> > progress on better encodings and finalizing v2 so we
> can
> > > use
> > > >> > them!
> > > >> > > > >>>>
> > > >> > > > >>>>
> > > >> > > > >>>> Are there JIRAs or other documentation that is tracking
> > this
> > > >> work?
> > > >> > > > >>>>
> > > >> > > > >>>> Thanks,
> > > >> > > > >>>> Micah
> > > >> > > > >>>>
> > > >> > > > >>>>
> > > >> > > > >>>> On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue <
> > [email protected]
> > > >
> > > >> > > wrote:
> > > >> > > > >>>>
> > > >> > > > >>>> > While there isn't anything wrong with it, the same
> > > challenges
> > > >> > have
> > > >> > > > >>>> been
> > > >> > > > >>>> > solved in different ways with v1 pages. The main
> > difference
> > > >> is
> > > >> > > that
> > > >> > > > v2
> > > >> > > > >>>> > pages are broken at record boundaries, and v1 pages
> > weren't
> > > >> > > > >>>> guaranteed to
> > > >> > > > >>>> > be. But, in order to write page indexes near the
> footer,
> > > >> > breaking
> > > >> > > > >>>> pages at
> > > >> > > > >>>> > record boundaries is required. So you know if you have
> > page
> > > >> > > indexes,
> > > >> > > > >>>> you
> > > >> > > > >>>> > can actually use them to skip through pages safely.
> That
> > > >> removes
> > > >> > > > much
> > > >> > > > >>>> of
> > > >> > > > >>>> > the need for v2 pages.
> > > >> > > > >>>> >
> > > >> > > > >>>> > The main drawback to using v2 pages is that the v2 spec
> > is
> > > >> > > > >>>> unfinished, and
> > > >> > > > >>>> > I don't think there is a way to use just the new pages.
> > So
> > > >> you'd
> > > >> > > > >>>> possibly
> > > >> > > > >>>> > end up pulling in other beta features that probably
> > > >> shouldn't be
> > > >> > > > used
> > > >> > > > >>>> if
> > > >> > > > >>>> > you want to stick with what is required for
> compatibility
> > > >> across
> > > >> > > > >>>> > implementations.
> > > >> > > > >>>> >
> > > >> > > > >>>> > We only use v1 and it still works well. That said, I'd
> > love
> > > >> to
> > > >> > > make
> > > >> > > > >>>> some
> > > >> > > > >>>> > progress on better encodings and finalizing v2 so we
> can
> > > use
> > > >> > them!
> > > >> > > > >>>> >
> > > >> > > > >>>> > On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield <
> > > >> > > > >>>> [email protected]>
> > > >> > > > >>>> > wrote:
> > > >> > > > >>>> >
> > > >> > > > >>>> >> What is the current status of support for Data Page
> V2?
> > > Is
> > > >> it
> > > >> > > > >>>> recommended
> > > >> > > > >>>> >> for production workloads?
> > > >> > > > >>>> >>
> > > >> > > > >>>> >> Thanks,
> > > >> > > > >>>> >> Micah
> > > >> > > > >>>> >>
> > > >> > > > >>>> >
> > > >> > > > >>>> >
> > > >> > > > >>>> > --
> > > >> > > > >>>> > Ryan Blue
> > > >> > > > >>>> > Software Engineer
> > > >> > > > >>>> > Netflix
> > > >> > > > >>>> >
> > > >> > > > >>>>
> > > >> > > > >>>
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Ryan Blue
> > > >> > Software Engineer
> > > >> > Netflix
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Current status of Data Page V2?

Reply via email to