>
> I am not sure 2.0 means the v2 pages here. I think there was/is a bit of
> confusion between the v1/v2 pages and the parquet-mr releases. Maybe the
> parquet-format releases are also part of it.


+1 to the confusion part.  The reason why I originally started this thread
is that none of this is entirely clear to me from existing documentation.

In particular it is confusing to me to say that the V2 Spec is not yet
finished when it looks like there have been multiple V2 Format releases.

It would be extremely useful to have documentation relating features to:
1.  The version of the spec they are part of
2.  There current status in reference implementations

Thanks,
Micah


On Tue, Oct 13, 2020 at 1:51 AM Gabor Szadovszky <ga...@apache.org> wrote:

> I am not sure 2.0 means the v2 pages here. I think there was/is a bit of
> confusion between the v1/v2 pages and the parquet-mr releases. Maybe the
> parquet-format releases are also part of it.
> In this table many features are not related to the pages so I don't think
> the "Expected release" meant the v1/v2 pages. I guess there was an earlier
> plan to release parquet-mr 2.0 with the v2 pages but then v2 pages were
> released in a 1.x release while 2.0 is not planned yet. (I was not in the
> community that time so I'm only guessing.)
>
> Also worth to mention that it seems to be not related to the parquet-format
> releases which means that based on the spec the implementations were/are
> not limited by this table.
>
>
> On Mon, Oct 12, 2020 at 6:53 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
> > I remembered that there used to be a table. Looks like it was removed:
> >
> >
> https://github.com/apache/parquet-mr/commit/aed9097640c7adffe1151b32e86b5efc3702c657?short_path=b335630#diff-04c6e90faac2675aa89e2176d2eec7d8
> >
> > The table used to list delta as a 2.0 feature.
> >
> > On Mon, Oct 12, 2020 at 1:38 AM Gabor Szadovszky <ga...@apache.org>
> wrote:
> >
> > > That answer I wrote to the other thread was based on the current code.
> So
> > > that is how parquet-mr is working now. It does not mean though how
> shall
> > it
> > > work or how it works in other implementations. Unfortunately, the spec
> > does
> > > not say anything about v1 and v2 in the context of encodings.
> > > Meanwhile, enabling the "new" encodings in v1 may generate
> compatibility
> > > issues with other implementations. (I am not sure how would the
> existing
> > > releases of parquet-mr behave if they have to read v1 pages with these
> > > encodings but I believe it would work fine.)
> > >
> > > I think, it would be a good idea to keep the existing default behavior
> as
> > > is but introduce some new flags where the user may set/suggest
> encodings
> > > for the different columns. This way the user can hold the risk of being
> > > potentially incompatible with other implementations (for the time
> being)
> > > and also can fine tune the encodings for the data. This way we can also
> > > introduce some new encodings that are better in some cases (e.g. lossy
> > > compression for floating point numbers).
> > >
> > > What do you guys thing?
> > > (I would be happy to help anyone would like to contribute in this
> topic.)
> > >
> > > Cheers,
> > > Gabor
> > >
> > > On Sat, Oct 10, 2020 at 2:18 AM Jacques Nadeau <jacq...@apache.org>
> > wrote:
> > >
> > > > Gabor seems to agree that delta is V2 only.
> > > >
> > > > To summarize, no delta encodings are used for V1 pages. They are
> > > available
> > > > > for V2 only.
> > > >
> > > >
> > > > https://www.mail-archive.com/dev@parquet.apache.org/msg11826.html
> > > >
> > > >
> > > >
> > > > On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau <jacq...@apache.org>
> > > wrote:
> > > >
> > > > > Good point. I had mentally categorized this as V2, not based on the
> > > docs?
> > > > >
> > > > > I don't think most tools write this but I can't see anywhere that
> it
> > > says
> > > > > it is limited to v2 readers/writers. I'm not sure how many tools
> > > > vectorize
> > > > > read it versus delegate to the legacy mr path (at least in java),
> > > either.
> > > > >
> > > > > On Fri, Oct 9, 2020 at 2:25 PM Micah Kornfield <
> > emkornfi...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> The big win in v2 pages (if I remember correctly) is that the
> > variable
> > > > >>> length encoding is no longer interleaved. That would provide a
> big
> > > > >>> performance lift when pulling into arrow vectors (and variable
> > length
> > > > >>> decoding typically dominates total read processing time, on
> average
> > > > I've
> > > > >>> seen 5-10x per cell cpu cost increase for variable reads over
> > scalar
> > > > >>> reads). AFAIK, there is still no option for that in V1.
> > > > >>
> > > > >>
> > > > >> For Delta-length byte the documentation [1]   states "This
> encoding
> > is
> > > > >> always preferred over PLAIN for byte array columns." makes it
> sound
> > > like
> > > > >> this could be part of V1 or V2.   The encoding enum in the Thrift
> > file
> > > > [2]
> > > > >> doesn't seem to document this either.
> > > > >>
> > > > >> Is there clearer documentation for what encodings are considered
> > part
> > > of
> > > > >> v1 vs v2?
> > > > >>
> > > > >> Thanks,
> > > > >> Micah
> > > > >>
> > > > >>
> > > > >> [1]
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
> > > > >> [2]
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407
> > > > >>
> > > > >> On Fri, Oct 9, 2020 at 11:37 AM Jacques Nadeau <
> jacq...@apache.org>
> > > > >> wrote:
> > > > >>
> > > > >>> The big win in v2 pages (if I remember correctly) is that the
> > > variable
> > > > >>> length encoding is no longer interleaved. That would provide a
> big
> > > > >>> performance lift when pulling into arrow vectors (and variable
> > length
> > > > >>> decoding typically dominates total read processing time, on
> average
> > > > I've
> > > > >>> seen 5-10x per cell cpu cost increase for variable reads over
> > scalar
> > > > >>> reads). AFAIK, there is still no option for that in V1.
> > > > >>>
> > > > >>> On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield <
> > > emkornfi...@gmail.com
> > > > >
> > > > >>> wrote:
> > > > >>>
> > > > >>>> Thanks for the quick reply Ryan.
> > > > >>>>
> > > > >>>>
> > > > >>>> > We only use v1 and it still works well. That said, I'd love to
> > > make
> > > > >>>> some
> > > > >>>> > progress on better encodings and finalizing v2 so we can use
> > them!
> > > > >>>>
> > > > >>>>
> > > > >>>> Are there JIRAs or other documentation that is tracking this
> work?
> > > > >>>>
> > > > >>>> Thanks,
> > > > >>>> Micah
> > > > >>>>
> > > > >>>>
> > > > >>>> On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue <rb...@netflix.com>
> > > wrote:
> > > > >>>>
> > > > >>>> > While there isn't anything wrong with it, the same challenges
> > have
> > > > >>>> been
> > > > >>>> > solved in different ways with v1 pages. The main difference is
> > > that
> > > > v2
> > > > >>>> > pages are broken at record boundaries, and v1 pages weren't
> > > > >>>> guaranteed to
> > > > >>>> > be. But, in order to write page indexes near the footer,
> > breaking
> > > > >>>> pages at
> > > > >>>> > record boundaries is required. So you know if you have page
> > > indexes,
> > > > >>>> you
> > > > >>>> > can actually use them to skip through pages safely. That
> removes
> > > > much
> > > > >>>> of
> > > > >>>> > the need for v2 pages.
> > > > >>>> >
> > > > >>>> > The main drawback to using v2 pages is that the v2 spec is
> > > > >>>> unfinished, and
> > > > >>>> > I don't think there is a way to use just the new pages. So
> you'd
> > > > >>>> possibly
> > > > >>>> > end up pulling in other beta features that probably shouldn't
> be
> > > > used
> > > > >>>> if
> > > > >>>> > you want to stick with what is required for compatibility
> across
> > > > >>>> > implementations.
> > > > >>>> >
> > > > >>>> > We only use v1 and it still works well. That said, I'd love to
> > > make
> > > > >>>> some
> > > > >>>> > progress on better encodings and finalizing v2 so we can use
> > them!
> > > > >>>> >
> > > > >>>> > On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield <
> > > > >>>> emkornfi...@gmail.com>
> > > > >>>> > wrote:
> > > > >>>> >
> > > > >>>> >> What is the current status of support for Data Page V2?  Is
> it
> > > > >>>> recommended
> > > > >>>> >> for production workloads?
> > > > >>>> >>
> > > > >>>> >> Thanks,
> > > > >>>> >> Micah
> > > > >>>> >>
> > > > >>>> >
> > > > >>>> >
> > > > >>>> > --
> > > > >>>> > Ryan Blue
> > > > >>>> > Software Engineer
> > > > >>>> > Netflix
> > > > >>>> >
> > > > >>>>
> > > > >>>
> > > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>

Reply via email to