Re: Current status of Data Page V2?

Ryan Blue Mon, 12 Oct 2020 09:53:52 -0700

I remembered that there used to be a table. Looks like it was removed:
https://github.com/apache/parquet-mr/commit/aed9097640c7adffe1151b32e86b5efc3702c657?short_path=b335630#diff-04c6e90faac2675aa89e2176d2eec7d8


The table used to list delta as a 2.0 feature.

On Mon, Oct 12, 2020 at 1:38 AM Gabor Szadovszky <[email protected]> wrote:

> That answer I wrote to the other thread was based on the current code. So
> that is how parquet-mr is working now. It does not mean though how shall it
> work or how it works in other implementations. Unfortunately, the spec does
> not say anything about v1 and v2 in the context of encodings.
> Meanwhile, enabling the "new" encodings in v1 may generate compatibility
> issues with other implementations. (I am not sure how would the existing
> releases of parquet-mr behave if they have to read v1 pages with these
> encodings but I believe it would work fine.)
>
> I think, it would be a good idea to keep the existing default behavior as
> is but introduce some new flags where the user may set/suggest encodings
> for the different columns. This way the user can hold the risk of being
> potentially incompatible with other implementations (for the time being)
> and also can fine tune the encodings for the data. This way we can also
> introduce some new encodings that are better in some cases (e.g. lossy
> compression for floating point numbers).
>
> What do you guys thing?
> (I would be happy to help anyone would like to contribute in this topic.)
>
> Cheers,
> Gabor
>
> On Sat, Oct 10, 2020 at 2:18 AM Jacques Nadeau <[email protected]> wrote:
>
> > Gabor seems to agree that delta is V2 only.
> >
> > To summarize, no delta encodings are used for V1 pages. They are
> available
> > > for V2 only.
> >
> >
> > https://www.mail-archive.com/[email protected]/msg11826.html
> >
> >
> >
> > On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau <[email protected]>
> wrote:
> >
> > > Good point. I had mentally categorized this as V2, not based on the
> docs?
> > >
> > > I don't think most tools write this but I can't see anywhere that it
> says
> > > it is limited to v2 readers/writers. I'm not sure how many tools
> > vectorize
> > > read it versus delegate to the legacy mr path (at least in java),
> either.
> > >
> > > On Fri, Oct 9, 2020 at 2:25 PM Micah Kornfield <[email protected]>
> > > wrote:
> > >
> > >> The big win in v2 pages (if I remember correctly) is that the variable
> > >>> length encoding is no longer interleaved. That would provide a big
> > >>> performance lift when pulling into arrow vectors (and variable length
> > >>> decoding typically dominates total read processing time, on average
> > I've
> > >>> seen 5-10x per cell cpu cost increase for variable reads over scalar
> > >>> reads). AFAIK, there is still no option for that in V1.
> > >>
> > >>
> > >> For Delta-length byte the documentation [1]   states "This encoding is
> > >> always preferred over PLAIN for byte array columns." makes it sound
> like
> > >> this could be part of V1 or V2.   The encoding enum in the Thrift file
> > [2]
> > >> doesn't seem to document this either.
> > >>
> > >> Is there clearer documentation for what encodings are considered part
> of
> > >> v1 vs v2?
> > >>
> > >> Thanks,
> > >> Micah
> > >>
> > >>
> > >> [1]
> > >>
> >
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
> > >> [2]
> > >>
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407
> > >>
> > >> On Fri, Oct 9, 2020 at 11:37 AM Jacques Nadeau <[email protected]>
> > >> wrote:
> > >>
> > >>> The big win in v2 pages (if I remember correctly) is that the
> variable
> > >>> length encoding is no longer interleaved. That would provide a big
> > >>> performance lift when pulling into arrow vectors (and variable length
> > >>> decoding typically dominates total read processing time, on average
> > I've
> > >>> seen 5-10x per cell cpu cost increase for variable reads over scalar
> > >>> reads). AFAIK, there is still no option for that in V1.
> > >>>
> > >>> On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield <
> [email protected]
> > >
> > >>> wrote:
> > >>>
> > >>>> Thanks for the quick reply Ryan.
> > >>>>
> > >>>>
> > >>>> > We only use v1 and it still works well. That said, I'd love to
> make
> > >>>> some
> > >>>> > progress on better encodings and finalizing v2 so we can use them!
> > >>>>
> > >>>>
> > >>>> Are there JIRAs or other documentation that is tracking this work?
> > >>>>
> > >>>> Thanks,
> > >>>> Micah
> > >>>>
> > >>>>
> > >>>> On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue <[email protected]>
> wrote:
> > >>>>
> > >>>> > While there isn't anything wrong with it, the same challenges have
> > >>>> been
> > >>>> > solved in different ways with v1 pages. The main difference is
> that
> > v2
> > >>>> > pages are broken at record boundaries, and v1 pages weren't
> > >>>> guaranteed to
> > >>>> > be. But, in order to write page indexes near the footer, breaking
> > >>>> pages at
> > >>>> > record boundaries is required. So you know if you have page
> indexes,
> > >>>> you
> > >>>> > can actually use them to skip through pages safely. That removes
> > much
> > >>>> of
> > >>>> > the need for v2 pages.
> > >>>> >
> > >>>> > The main drawback to using v2 pages is that the v2 spec is
> > >>>> unfinished, and
> > >>>> > I don't think there is a way to use just the new pages. So you'd
> > >>>> possibly
> > >>>> > end up pulling in other beta features that probably shouldn't be
> > used
> > >>>> if
> > >>>> > you want to stick with what is required for compatibility across
> > >>>> > implementations.
> > >>>> >
> > >>>> > We only use v1 and it still works well. That said, I'd love to
> make
> > >>>> some
> > >>>> > progress on better encodings and finalizing v2 so we can use them!
> > >>>> >
> > >>>> > On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield <
> > >>>> [email protected]>
> > >>>> > wrote:
> > >>>> >
> > >>>> >> What is the current status of support for Data Page V2?  Is it
> > >>>> recommended
> > >>>> >> for production workloads?
> > >>>> >>
> > >>>> >> Thanks,
> > >>>> >> Micah
> > >>>> >>
> > >>>> >
> > >>>> >
> > >>>> > --
> > >>>> > Ryan Blue
> > >>>> > Software Engineer
> > >>>> > Netflix
> > >>>> >
> > >>>>
> > >>>
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Current status of Data Page V2?

Reply via email to