Re: Current status of Data Page V2?

Jacques Nadeau Fri, 09 Oct 2020 17:07:35 -0700

Good point. I had mentally categorized this as V2, not based on the docs?

I don't think most tools write this but I can't see anywhere that it says
it is limited to v2 readers/writers. I'm not sure how many tools vectorize
read it versus delegate to the legacy mr path (at least in java), either.


On Fri, Oct 9, 2020 at 2:25 PM Micah Kornfield <[email protected]>
wrote:

> The big win in v2 pages (if I remember correctly) is that the variable
>> length encoding is no longer interleaved. That would provide a big
>> performance lift when pulling into arrow vectors (and variable length
>> decoding typically dominates total read processing time, on average I've
>> seen 5-10x per cell cpu cost increase for variable reads over scalar
>> reads). AFAIK, there is still no option for that in V1.
>
>
> For Delta-length byte the documentation [1]   states "This encoding is
> always preferred over PLAIN for byte array columns." makes it sound like
> this could be part of V1 or V2.   The encoding enum in the Thrift file [2]
> doesn't seem to document this either.
>
> Is there clearer documentation for what encodings are considered part of
> v1 vs v2?
>
> Thanks,
> Micah
>
>
> [1]
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
> [2]
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407
>
> On Fri, Oct 9, 2020 at 11:37 AM Jacques Nadeau <[email protected]> wrote:
>
>> The big win in v2 pages (if I remember correctly) is that the variable
>> length encoding is no longer interleaved. That would provide a big
>> performance lift when pulling into arrow vectors (and variable length
>> decoding typically dominates total read processing time, on average I've
>> seen 5-10x per cell cpu cost increase for variable reads over scalar
>> reads). AFAIK, there is still no option for that in V1.
>>
>> On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield <[email protected]>
>> wrote:
>>
>>> Thanks for the quick reply Ryan.
>>>
>>>
>>> > We only use v1 and it still works well. That said, I'd love to make
>>> some
>>> > progress on better encodings and finalizing v2 so we can use them!
>>>
>>>
>>> Are there JIRAs or other documentation that is tracking this work?
>>>
>>> Thanks,
>>> Micah
>>>
>>>
>>> On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue <[email protected]> wrote:
>>>
>>> > While there isn't anything wrong with it, the same challenges have been
>>> > solved in different ways with v1 pages. The main difference is that v2
>>> > pages are broken at record boundaries, and v1 pages weren't guaranteed
>>> to
>>> > be. But, in order to write page indexes near the footer, breaking
>>> pages at
>>> > record boundaries is required. So you know if you have page indexes,
>>> you
>>> > can actually use them to skip through pages safely. That removes much
>>> of
>>> > the need for v2 pages.
>>> >
>>> > The main drawback to using v2 pages is that the v2 spec is unfinished,
>>> and
>>> > I don't think there is a way to use just the new pages. So you'd
>>> possibly
>>> > end up pulling in other beta features that probably shouldn't be used
>>> if
>>> > you want to stick with what is required for compatibility across
>>> > implementations.
>>> >
>>> > We only use v1 and it still works well. That said, I'd love to make
>>> some
>>> > progress on better encodings and finalizing v2 so we can use them!
>>> >
>>> > On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield <[email protected]
>>> >
>>> > wrote:
>>> >
>>> >> What is the current status of support for Data Page V2?  Is it
>>> recommended
>>> >> for production workloads?
>>> >>
>>> >> Thanks,
>>> >> Micah
>>> >>
>>> >
>>> >
>>> > --
>>> > Ryan Blue
>>> > Software Engineer
>>> > Netflix
>>> >
>>>
>>

Re: Current status of Data Page V2?

Reply via email to