Re: Current status of Data Page V2?

Jacques Nadeau Fri, 09 Oct 2020 17:18:34 -0700

Gabor seems to agree that delta is V2 only.

To summarize, no delta encodings are used for V1 pages. They are available
> for V2 only.



https://www.mail-archive.com/[email protected]/msg11826.html



On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau <[email protected]> wrote:

> Good point. I had mentally categorized this as V2, not based on the docs?
>
> I don't think most tools write this but I can't see anywhere that it says
> it is limited to v2 readers/writers. I'm not sure how many tools vectorize
> read it versus delegate to the legacy mr path (at least in java), either.
>
> On Fri, Oct 9, 2020 at 2:25 PM Micah Kornfield <[email protected]>
> wrote:
>
>> The big win in v2 pages (if I remember correctly) is that the variable
>>> length encoding is no longer interleaved. That would provide a big
>>> performance lift when pulling into arrow vectors (and variable length
>>> decoding typically dominates total read processing time, on average I've
>>> seen 5-10x per cell cpu cost increase for variable reads over scalar
>>> reads). AFAIK, there is still no option for that in V1.
>>
>>
>> For Delta-length byte the documentation [1]   states "This encoding is
>> always preferred over PLAIN for byte array columns." makes it sound like
>> this could be part of V1 or V2.   The encoding enum in the Thrift file [2]
>> doesn't seem to document this either.
>>
>> Is there clearer documentation for what encodings are considered part of
>> v1 vs v2?
>>
>> Thanks,
>> Micah
>>
>>
>> [1]
>> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
>> [2]
>> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407
>>
>> On Fri, Oct 9, 2020 at 11:37 AM Jacques Nadeau <[email protected]>
>> wrote:
>>
>>> The big win in v2 pages (if I remember correctly) is that the variable
>>> length encoding is no longer interleaved. That would provide a big
>>> performance lift when pulling into arrow vectors (and variable length
>>> decoding typically dominates total read processing time, on average I've
>>> seen 5-10x per cell cpu cost increase for variable reads over scalar
>>> reads). AFAIK, there is still no option for that in V1.
>>>
>>> On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield <[email protected]>
>>> wrote:
>>>
>>>> Thanks for the quick reply Ryan.
>>>>
>>>>
>>>> > We only use v1 and it still works well. That said, I'd love to make
>>>> some
>>>> > progress on better encodings and finalizing v2 so we can use them!
>>>>
>>>>
>>>> Are there JIRAs or other documentation that is tracking this work?
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>>
>>>> On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue <[email protected]> wrote:
>>>>
>>>> > While there isn't anything wrong with it, the same challenges have
>>>> been
>>>> > solved in different ways with v1 pages. The main difference is that v2
>>>> > pages are broken at record boundaries, and v1 pages weren't
>>>> guaranteed to
>>>> > be. But, in order to write page indexes near the footer, breaking
>>>> pages at
>>>> > record boundaries is required. So you know if you have page indexes,
>>>> you
>>>> > can actually use them to skip through pages safely. That removes much
>>>> of
>>>> > the need for v2 pages.
>>>> >
>>>> > The main drawback to using v2 pages is that the v2 spec is
>>>> unfinished, and
>>>> > I don't think there is a way to use just the new pages. So you'd
>>>> possibly
>>>> > end up pulling in other beta features that probably shouldn't be used
>>>> if
>>>> > you want to stick with what is required for compatibility across
>>>> > implementations.
>>>> >
>>>> > We only use v1 and it still works well. That said, I'd love to make
>>>> some
>>>> > progress on better encodings and finalizing v2 so we can use them!
>>>> >
>>>> > On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield <
>>>> [email protected]>
>>>> > wrote:
>>>> >
>>>> >> What is the current status of support for Data Page V2?  Is it
>>>> recommended
>>>> >> for production workloads?
>>>> >>
>>>> >> Thanks,
>>>> >> Micah
>>>> >>
>>>> >
>>>> >
>>>> > --
>>>> > Ryan Blue
>>>> > Software Engineer
>>>> > Netflix
>>>> >
>>>>
>>>

Re: Current status of Data Page V2?

Reply via email to