Re: Current status of Data Page V2?

Micah Kornfield Wed, 21 Oct 2020 21:04:50 -0700

I've created https://github.com/apache/parquet-format/pull/163 to try to
document these (note I really don't have historical context here so please
review carefully).


I would appreciate it if someone could point me to a reference on what the
current status of V2 is?  What is left unsettled? When can we start
recommending it for production use?

Thanks,
Micah

On Tue, Oct 13, 2020 at 9:23 AM Micah Kornfield <[email protected]>
wrote:

> I am not sure 2.0 means the v2 pages here. I think there was/is a bit of
>> confusion between the v1/v2 pages and the parquet-mr releases. Maybe the
>> parquet-format releases are also part of it.
>
>
> +1 to the confusion part.  The reason why I originally started this thread
> is that none of this is entirely clear to me from existing documentation.
>
> In particular it is confusing to me to say that the V2 Spec is not yet
> finished when it looks like there have been multiple V2 Format releases.
>
> It would be extremely useful to have documentation relating features to:
> 1.  The version of the spec they are part of
> 2.  There current status in reference implementations
>
> Thanks,
> Micah
>
>
> On Tue, Oct 13, 2020 at 1:51 AM Gabor Szadovszky <[email protected]> wrote:
>
>> I am not sure 2.0 means the v2 pages here. I think there was/is a bit of
>> confusion between the v1/v2 pages and the parquet-mr releases. Maybe the
>> parquet-format releases are also part of it.
>> In this table many features are not related to the pages so I don't think
>> the "Expected release" meant the v1/v2 pages. I guess there was an earlier
>> plan to release parquet-mr 2.0 with the v2 pages but then v2 pages were
>> released in a 1.x release while 2.0 is not planned yet. (I was not in the
>> community that time so I'm only guessing.)
>>
>> Also worth to mention that it seems to be not related to the
>> parquet-format
>> releases which means that based on the spec the implementations were/are
>> not limited by this table.
>>
>>
>> On Mon, Oct 12, 2020 at 6:53 PM Ryan Blue <[email protected]>
>> wrote:
>>
>> > I remembered that there used to be a table. Looks like it was removed:
>> >
>> >
>> https://github.com/apache/parquet-mr/commit/aed9097640c7adffe1151b32e86b5efc3702c657?short_path=b335630#diff-04c6e90faac2675aa89e2176d2eec7d8
>> >
>> > The table used to list delta as a 2.0 feature.
>> >
>> > On Mon, Oct 12, 2020 at 1:38 AM Gabor Szadovszky <[email protected]>
>> wrote:
>> >
>> > > That answer I wrote to the other thread was based on the current
>> code. So
>> > > that is how parquet-mr is working now. It does not mean though how
>> shall
>> > it
>> > > work or how it works in other implementations. Unfortunately, the spec
>> > does
>> > > not say anything about v1 and v2 in the context of encodings.
>> > > Meanwhile, enabling the "new" encodings in v1 may generate
>> compatibility
>> > > issues with other implementations. (I am not sure how would the
>> existing
>> > > releases of parquet-mr behave if they have to read v1 pages with these
>> > > encodings but I believe it would work fine.)
>> > >
>> > > I think, it would be a good idea to keep the existing default
>> behavior as
>> > > is but introduce some new flags where the user may set/suggest
>> encodings
>> > > for the different columns. This way the user can hold the risk of
>> being
>> > > potentially incompatible with other implementations (for the time
>> being)
>> > > and also can fine tune the encodings for the data. This way we can
>> also
>> > > introduce some new encodings that are better in some cases (e.g. lossy
>> > > compression for floating point numbers).
>> > >
>> > > What do you guys thing?
>> > > (I would be happy to help anyone would like to contribute in this
>> topic.)
>> > >
>> > > Cheers,
>> > > Gabor
>> > >
>> > > On Sat, Oct 10, 2020 at 2:18 AM Jacques Nadeau <[email protected]>
>> > wrote:
>> > >
>> > > > Gabor seems to agree that delta is V2 only.
>> > > >
>> > > > To summarize, no delta encodings are used for V1 pages. They are
>> > > available
>> > > > > for V2 only.
>> > > >
>> > > >
>> > > > https://www.mail-archive.com/[email protected]/msg11826.html
>> > > >
>> > > >
>> > > >
>> > > > On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau <[email protected]>
>> > > wrote:
>> > > >
>> > > > > Good point. I had mentally categorized this as V2, not based on
>> the
>> > > docs?
>> > > > >
>> > > > > I don't think most tools write this but I can't see anywhere that
>> it
>> > > says
>> > > > > it is limited to v2 readers/writers. I'm not sure how many tools
>> > > > vectorize
>> > > > > read it versus delegate to the legacy mr path (at least in java),
>> > > either.
>> > > > >
>> > > > > On Fri, Oct 9, 2020 at 2:25 PM Micah Kornfield <
>> > [email protected]>
>> > > > > wrote:
>> > > > >
>> > > > >> The big win in v2 pages (if I remember correctly) is that the
>> > variable
>> > > > >>> length encoding is no longer interleaved. That would provide a
>> big
>> > > > >>> performance lift when pulling into arrow vectors (and variable
>> > length
>> > > > >>> decoding typically dominates total read processing time, on
>> average
>> > > > I've
>> > > > >>> seen 5-10x per cell cpu cost increase for variable reads over
>> > scalar
>> > > > >>> reads). AFAIK, there is still no option for that in V1.
>> > > > >>
>> > > > >>
>> > > > >> For Delta-length byte the documentation [1]   states "This
>> encoding
>> > is
>> > > > >> always preferred over PLAIN for byte array columns." makes it
>> sound
>> > > like
>> > > > >> this could be part of V1 or V2.   The encoding enum in the Thrift
>> > file
>> > > > [2]
>> > > > >> doesn't seem to document this either.
>> > > > >>
>> > > > >> Is there clearer documentation for what encodings are considered
>> > part
>> > > of
>> > > > >> v1 vs v2?
>> > > > >>
>> > > > >> Thanks,
>> > > > >> Micah
>> > > > >>
>> > > > >>
>> > > > >> [1]
>> > > > >>
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
>> > > > >> [2]
>> > > > >>
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407
>> > > > >>
>> > > > >> On Fri, Oct 9, 2020 at 11:37 AM Jacques Nadeau <
>> [email protected]>
>> > > > >> wrote:
>> > > > >>
>> > > > >>> The big win in v2 pages (if I remember correctly) is that the
>> > > variable
>> > > > >>> length encoding is no longer interleaved. That would provide a
>> big
>> > > > >>> performance lift when pulling into arrow vectors (and variable
>> > length
>> > > > >>> decoding typically dominates total read processing time, on
>> average
>> > > > I've
>> > > > >>> seen 5-10x per cell cpu cost increase for variable reads over
>> > scalar
>> > > > >>> reads). AFAIK, there is still no option for that in V1.
>> > > > >>>
>> > > > >>> On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield <
>> > > [email protected]
>> > > > >
>> > > > >>> wrote:
>> > > > >>>
>> > > > >>>> Thanks for the quick reply Ryan.
>> > > > >>>>
>> > > > >>>>
>> > > > >>>> > We only use v1 and it still works well. That said, I'd love
>> to
>> > > make
>> > > > >>>> some
>> > > > >>>> > progress on better encodings and finalizing v2 so we can use
>> > them!
>> > > > >>>>
>> > > > >>>>
>> > > > >>>> Are there JIRAs or other documentation that is tracking this
>> work?
>> > > > >>>>
>> > > > >>>> Thanks,
>> > > > >>>> Micah
>> > > > >>>>
>> > > > >>>>
>> > > > >>>> On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue <[email protected]>
>> > > wrote:
>> > > > >>>>
>> > > > >>>> > While there isn't anything wrong with it, the same challenges
>> > have
>> > > > >>>> been
>> > > > >>>> > solved in different ways with v1 pages. The main difference
>> is
>> > > that
>> > > > v2
>> > > > >>>> > pages are broken at record boundaries, and v1 pages weren't
>> > > > >>>> guaranteed to
>> > > > >>>> > be. But, in order to write page indexes near the footer,
>> > breaking
>> > > > >>>> pages at
>> > > > >>>> > record boundaries is required. So you know if you have page
>> > > indexes,
>> > > > >>>> you
>> > > > >>>> > can actually use them to skip through pages safely. That
>> removes
>> > > > much
>> > > > >>>> of
>> > > > >>>> > the need for v2 pages.
>> > > > >>>> >
>> > > > >>>> > The main drawback to using v2 pages is that the v2 spec is
>> > > > >>>> unfinished, and
>> > > > >>>> > I don't think there is a way to use just the new pages. So
>> you'd
>> > > > >>>> possibly
>> > > > >>>> > end up pulling in other beta features that probably
>> shouldn't be
>> > > > used
>> > > > >>>> if
>> > > > >>>> > you want to stick with what is required for compatibility
>> across
>> > > > >>>> > implementations.
>> > > > >>>> >
>> > > > >>>> > We only use v1 and it still works well. That said, I'd love
>> to
>> > > make
>> > > > >>>> some
>> > > > >>>> > progress on better encodings and finalizing v2 so we can use
>> > them!
>> > > > >>>> >
>> > > > >>>> > On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield <
>> > > > >>>> [email protected]>
>> > > > >>>> > wrote:
>> > > > >>>> >
>> > > > >>>> >> What is the current status of support for Data Page V2?  Is
>> it
>> > > > >>>> recommended
>> > > > >>>> >> for production workloads?
>> > > > >>>> >>
>> > > > >>>> >> Thanks,
>> > > > >>>> >> Micah
>> > > > >>>> >>
>> > > > >>>> >
>> > > > >>>> >
>> > > > >>>> > --
>> > > > >>>> > Ryan Blue
>> > > > >>>> > Software Engineer
>> > > > >>>> > Netflix
>> > > > >>>> >
>> > > > >>>>
>> > > > >>>
>> > > >
>> > >
>> >
>> >
>> > --
>> > Ryan Blue
>> > Software Engineer
>> > Netflix
>> >
>>
>

Re: Current status of Data Page V2?

Reply via email to