Re: Current status of Data Page V2?

Gabor Szadovszky Mon, 12 Oct 2020 01:38:50 -0700

That answer I wrote to the other thread was based on the current code. So
that is how parquet-mr is working now. It does not mean though how shall it
work or how it works in other implementations. Unfortunately, the spec does
not say anything about v1 and v2 in the context of encodings.
Meanwhile, enabling the "new" encodings in v1 may generate compatibility
issues with other implementations. (I am not sure how would the existing
releases of parquet-mr behave if they have to read v1 pages with these
encodings but I believe it would work fine.)


I think, it would be a good idea to keep the existing default behavior as
is but introduce some new flags where the user may set/suggest encodings
for the different columns. This way the user can hold the risk of being
potentially incompatible with other implementations (for the time being)
and also can fine tune the encodings for the data. This way we can also
introduce some new encodings that are better in some cases (e.g. lossy
compression for floating point numbers).

What do you guys thing?
(I would be happy to help anyone would like to contribute in this topic.)

Cheers,
Gabor

On Sat, Oct 10, 2020 at 2:18 AM Jacques Nadeau <jacq...@apache.org> wrote:

> Gabor seems to agree that delta is V2 only.
>
> To summarize, no delta encodings are used for V1 pages. They are available
> > for V2 only.
>
>
> https://www.mail-archive.com/dev@parquet.apache.org/msg11826.html
>
>
>
> On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau <jacq...@apache.org> wrote:
>
> > Good point. I had mentally categorized this as V2, not based on the docs?
> >
> > I don't think most tools write this but I can't see anywhere that it says
> > it is limited to v2 readers/writers. I'm not sure how many tools
> vectorize
> > read it versus delegate to the legacy mr path (at least in java), either.
> >
> > On Fri, Oct 9, 2020 at 2:25 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> >> The big win in v2 pages (if I remember correctly) is that the variable
> >>> length encoding is no longer interleaved. That would provide a big
> >>> performance lift when pulling into arrow vectors (and variable length
> >>> decoding typically dominates total read processing time, on average
> I've
> >>> seen 5-10x per cell cpu cost increase for variable reads over scalar
> >>> reads). AFAIK, there is still no option for that in V1.
> >>
> >>
> >> For Delta-length byte the documentation [1]   states "This encoding is
> >> always preferred over PLAIN for byte array columns." makes it sound like
> >> this could be part of V1 or V2.   The encoding enum in the Thrift file
> [2]
> >> doesn't seem to document this either.
> >>
> >> Is there clearer documentation for what encodings are considered part of
> >> v1 vs v2?
> >>
> >> Thanks,
> >> Micah
> >>
> >>
> >> [1]
> >>
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
> >> [2]
> >>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407
> >>
> >> On Fri, Oct 9, 2020 at 11:37 AM Jacques Nadeau <jacq...@apache.org>
> >> wrote:
> >>
> >>> The big win in v2 pages (if I remember correctly) is that the variable
> >>> length encoding is no longer interleaved. That would provide a big
> >>> performance lift when pulling into arrow vectors (and variable length
> >>> decoding typically dominates total read processing time, on average
> I've
> >>> seen 5-10x per cell cpu cost increase for variable reads over scalar
> >>> reads). AFAIK, there is still no option for that in V1.
> >>>
> >>> On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield <emkornfi...@gmail.com
> >
> >>> wrote:
> >>>
> >>>> Thanks for the quick reply Ryan.
> >>>>
> >>>>
> >>>> > We only use v1 and it still works well. That said, I'd love to make
> >>>> some
> >>>> > progress on better encodings and finalizing v2 so we can use them!
> >>>>
> >>>>
> >>>> Are there JIRAs or other documentation that is tracking this work?
> >>>>
> >>>> Thanks,
> >>>> Micah
> >>>>
> >>>>
> >>>> On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue <rb...@netflix.com> wrote:
> >>>>
> >>>> > While there isn't anything wrong with it, the same challenges have
> >>>> been
> >>>> > solved in different ways with v1 pages. The main difference is that
> v2
> >>>> > pages are broken at record boundaries, and v1 pages weren't
> >>>> guaranteed to
> >>>> > be. But, in order to write page indexes near the footer, breaking
> >>>> pages at
> >>>> > record boundaries is required. So you know if you have page indexes,
> >>>> you
> >>>> > can actually use them to skip through pages safely. That removes
> much
> >>>> of
> >>>> > the need for v2 pages.
> >>>> >
> >>>> > The main drawback to using v2 pages is that the v2 spec is
> >>>> unfinished, and
> >>>> > I don't think there is a way to use just the new pages. So you'd
> >>>> possibly
> >>>> > end up pulling in other beta features that probably shouldn't be
> used
> >>>> if
> >>>> > you want to stick with what is required for compatibility across
> >>>> > implementations.
> >>>> >
> >>>> > We only use v1 and it still works well. That said, I'd love to make
> >>>> some
> >>>> > progress on better encodings and finalizing v2 so we can use them!
> >>>> >
> >>>> > On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield <
> >>>> emkornfi...@gmail.com>
> >>>> > wrote:
> >>>> >
> >>>> >> What is the current status of support for Data Page V2?  Is it
> >>>> recommended
> >>>> >> for production workloads?
> >>>> >>
> >>>> >> Thanks,
> >>>> >> Micah
> >>>> >>
> >>>> >
> >>>> >
> >>>> > --
> >>>> > Ryan Blue
> >>>> > Software Engineer
> >>>> > Netflix
> >>>> >
> >>>>
> >>>
>

Re: Current status of Data Page V2?

Reply via email to