Re: Current status of Data Page V2?

Gabor Szadovszky Thu, 22 Oct 2020 01:31:58 -0700

It is still not clear to me if we want to recommend V2 for production use
at all or simply introduce the new encodings for V1. I would suggest
discussing this topic on the parquet sync next Tuesday.


On Thu, Oct 22, 2020 at 6:04 AM Micah Kornfield <[email protected]>
wrote:

> I've created https://github.com/apache/parquet-format/pull/163 to try to
> document these (note I really don't have historical context here so please
> review carefully).
>
> I would appreciate it if someone could point me to a reference on what the
> current status of V2 is?  What is left unsettled? When can we start
> recommending it for production use?
>
> Thanks,
> Micah
>
> On Tue, Oct 13, 2020 at 9:23 AM Micah Kornfield <[email protected]>
> wrote:
>
> > I am not sure 2.0 means the v2 pages here. I think there was/is a bit of
> >> confusion between the v1/v2 pages and the parquet-mr releases. Maybe the
> >> parquet-format releases are also part of it.
> >
> >
> > +1 to the confusion part.  The reason why I originally started this
> thread
> > is that none of this is entirely clear to me from existing documentation.
> >
> > In particular it is confusing to me to say that the V2 Spec is not yet
> > finished when it looks like there have been multiple V2 Format releases.
> >
> > It would be extremely useful to have documentation relating features to:
> > 1.  The version of the spec they are part of
> > 2.  There current status in reference implementations
> >
> > Thanks,
> > Micah
> >
> >
> > On Tue, Oct 13, 2020 at 1:51 AM Gabor Szadovszky <[email protected]>
> wrote:
> >
> >> I am not sure 2.0 means the v2 pages here. I think there was/is a bit of
> >> confusion between the v1/v2 pages and the parquet-mr releases. Maybe the
> >> parquet-format releases are also part of it.
> >> In this table many features are not related to the pages so I don't
> think
> >> the "Expected release" meant the v1/v2 pages. I guess there was an
> earlier
> >> plan to release parquet-mr 2.0 with the v2 pages but then v2 pages were
> >> released in a 1.x release while 2.0 is not planned yet. (I was not in
> the
> >> community that time so I'm only guessing.)
> >>
> >> Also worth to mention that it seems to be not related to the
> >> parquet-format
> >> releases which means that based on the spec the implementations were/are
> >> not limited by this table.
> >>
> >>
> >> On Mon, Oct 12, 2020 at 6:53 PM Ryan Blue <[email protected]>
> >> wrote:
> >>
> >> > I remembered that there used to be a table. Looks like it was removed:
> >> >
> >> >
> >>
> https://github.com/apache/parquet-mr/commit/aed9097640c7adffe1151b32e86b5efc3702c657?short_path=b335630#diff-04c6e90faac2675aa89e2176d2eec7d8
> >> >
> >> > The table used to list delta as a 2.0 feature.
> >> >
> >> > On Mon, Oct 12, 2020 at 1:38 AM Gabor Szadovszky <[email protected]>
> >> wrote:
> >> >
> >> > > That answer I wrote to the other thread was based on the current
> >> code. So
> >> > > that is how parquet-mr is working now. It does not mean though how
> >> shall
> >> > it
> >> > > work or how it works in other implementations. Unfortunately, the
> spec
> >> > does
> >> > > not say anything about v1 and v2 in the context of encodings.
> >> > > Meanwhile, enabling the "new" encodings in v1 may generate
> >> compatibility
> >> > > issues with other implementations. (I am not sure how would the
> >> existing
> >> > > releases of parquet-mr behave if they have to read v1 pages with
> these
> >> > > encodings but I believe it would work fine.)
> >> > >
> >> > > I think, it would be a good idea to keep the existing default
> >> behavior as
> >> > > is but introduce some new flags where the user may set/suggest
> >> encodings
> >> > > for the different columns. This way the user can hold the risk of
> >> being
> >> > > potentially incompatible with other implementations (for the time
> >> being)
> >> > > and also can fine tune the encodings for the data. This way we can
> >> also
> >> > > introduce some new encodings that are better in some cases (e.g.
> lossy
> >> > > compression for floating point numbers).
> >> > >
> >> > > What do you guys thing?
> >> > > (I would be happy to help anyone would like to contribute in this
> >> topic.)
> >> > >
> >> > > Cheers,
> >> > > Gabor
> >> > >
> >> > > On Sat, Oct 10, 2020 at 2:18 AM Jacques Nadeau <[email protected]>
> >> > wrote:
> >> > >
> >> > > > Gabor seems to agree that delta is V2 only.
> >> > > >
> >> > > > To summarize, no delta encodings are used for V1 pages. They are
> >> > > available
> >> > > > > for V2 only.
> >> > > >
> >> > > >
> >> > > > https://www.mail-archive.com/[email protected]/msg11826.html
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau <[email protected]
> >
> >> > > wrote:
> >> > > >
> >> > > > > Good point. I had mentally categorized this as V2, not based on
> >> the
> >> > > docs?
> >> > > > >
> >> > > > > I don't think most tools write this but I can't see anywhere
> that
> >> it
> >> > > says
> >> > > > > it is limited to v2 readers/writers. I'm not sure how many tools
> >> > > > vectorize
> >> > > > > read it versus delegate to the legacy mr path (at least in
> java),
> >> > > either.
> >> > > > >
> >> > > > > On Fri, Oct 9, 2020 at 2:25 PM Micah Kornfield <
> >> > [email protected]>
> >> > > > > wrote:
> >> > > > >
> >> > > > >> The big win in v2 pages (if I remember correctly) is that the
> >> > variable
> >> > > > >>> length encoding is no longer interleaved. That would provide a
> >> big
> >> > > > >>> performance lift when pulling into arrow vectors (and variable
> >> > length
> >> > > > >>> decoding typically dominates total read processing time, on
> >> average
> >> > > > I've
> >> > > > >>> seen 5-10x per cell cpu cost increase for variable reads over
> >> > scalar
> >> > > > >>> reads). AFAIK, there is still no option for that in V1.
> >> > > > >>
> >> > > > >>
> >> > > > >> For Delta-length byte the documentation [1]   states "This
> >> encoding
> >> > is
> >> > > > >> always preferred over PLAIN for byte array columns." makes it
> >> sound
> >> > > like
> >> > > > >> this could be part of V1 or V2.   The encoding enum in the
> Thrift
> >> > file
> >> > > > [2]
> >> > > > >> doesn't seem to document this either.
> >> > > > >>
> >> > > > >> Is there clearer documentation for what encodings are
> considered
> >> > part
> >> > > of
> >> > > > >> v1 vs v2?
> >> > > > >>
> >> > > > >> Thanks,
> >> > > > >> Micah
> >> > > > >>
> >> > > > >>
> >> > > > >> [1]
> >> > > > >>
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
> >> > > > >> [2]
> >> > > > >>
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407
> >> > > > >>
> >> > > > >> On Fri, Oct 9, 2020 at 11:37 AM Jacques Nadeau <
> >> [email protected]>
> >> > > > >> wrote:
> >> > > > >>
> >> > > > >>> The big win in v2 pages (if I remember correctly) is that the
> >> > > variable
> >> > > > >>> length encoding is no longer interleaved. That would provide a
> >> big
> >> > > > >>> performance lift when pulling into arrow vectors (and variable
> >> > length
> >> > > > >>> decoding typically dominates total read processing time, on
> >> average
> >> > > > I've
> >> > > > >>> seen 5-10x per cell cpu cost increase for variable reads over
> >> > scalar
> >> > > > >>> reads). AFAIK, there is still no option for that in V1.
> >> > > > >>>
> >> > > > >>> On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield <
> >> > > [email protected]
> >> > > > >
> >> > > > >>> wrote:
> >> > > > >>>
> >> > > > >>>> Thanks for the quick reply Ryan.
> >> > > > >>>>
> >> > > > >>>>
> >> > > > >>>> > We only use v1 and it still works well. That said, I'd love
> >> to
> >> > > make
> >> > > > >>>> some
> >> > > > >>>> > progress on better encodings and finalizing v2 so we can
> use
> >> > them!
> >> > > > >>>>
> >> > > > >>>>
> >> > > > >>>> Are there JIRAs or other documentation that is tracking this
> >> work?
> >> > > > >>>>
> >> > > > >>>> Thanks,
> >> > > > >>>> Micah
> >> > > > >>>>
> >> > > > >>>>
> >> > > > >>>> On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue <[email protected]
> >
> >> > > wrote:
> >> > > > >>>>
> >> > > > >>>> > While there isn't anything wrong with it, the same
> challenges
> >> > have
> >> > > > >>>> been
> >> > > > >>>> > solved in different ways with v1 pages. The main difference
> >> is
> >> > > that
> >> > > > v2
> >> > > > >>>> > pages are broken at record boundaries, and v1 pages weren't
> >> > > > >>>> guaranteed to
> >> > > > >>>> > be. But, in order to write page indexes near the footer,
> >> > breaking
> >> > > > >>>> pages at
> >> > > > >>>> > record boundaries is required. So you know if you have page
> >> > > indexes,
> >> > > > >>>> you
> >> > > > >>>> > can actually use them to skip through pages safely. That
> >> removes
> >> > > > much
> >> > > > >>>> of
> >> > > > >>>> > the need for v2 pages.
> >> > > > >>>> >
> >> > > > >>>> > The main drawback to using v2 pages is that the v2 spec is
> >> > > > >>>> unfinished, and
> >> > > > >>>> > I don't think there is a way to use just the new pages. So
> >> you'd
> >> > > > >>>> possibly
> >> > > > >>>> > end up pulling in other beta features that probably
> >> shouldn't be
> >> > > > used
> >> > > > >>>> if
> >> > > > >>>> > you want to stick with what is required for compatibility
> >> across
> >> > > > >>>> > implementations.
> >> > > > >>>> >
> >> > > > >>>> > We only use v1 and it still works well. That said, I'd love
> >> to
> >> > > make
> >> > > > >>>> some
> >> > > > >>>> > progress on better encodings and finalizing v2 so we can
> use
> >> > them!
> >> > > > >>>> >
> >> > > > >>>> > On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield <
> >> > > > >>>> [email protected]>
> >> > > > >>>> > wrote:
> >> > > > >>>> >
> >> > > > >>>> >> What is the current status of support for Data Page V2?
> Is
> >> it
> >> > > > >>>> recommended
> >> > > > >>>> >> for production workloads?
> >> > > > >>>> >>
> >> > > > >>>> >> Thanks,
> >> > > > >>>> >> Micah
> >> > > > >>>> >>
> >> > > > >>>> >
> >> > > > >>>> >
> >> > > > >>>> > --
> >> > > > >>>> > Ryan Blue
> >> > > > >>>> > Software Engineer
> >> > > > >>>> > Netflix
> >> > > > >>>> >
> >> > > > >>>>
> >> > > > >>>
> >> > > >
> >> > >
> >> >
> >> >
> >> > --
> >> > Ryan Blue
> >> > Software Engineer
> >> > Netflix
> >> >
> >>
> >
>

Re: Current status of Data Page V2?

Reply via email to