Gabor seems to agree that delta is V2 only. To summarize, no delta encodings are used for V1 pages. They are available > for V2 only.
https://www.mail-archive.com/[email protected]/msg11826.html On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau <[email protected]> wrote: > Good point. I had mentally categorized this as V2, not based on the docs? > > I don't think most tools write this but I can't see anywhere that it says > it is limited to v2 readers/writers. I'm not sure how many tools vectorize > read it versus delegate to the legacy mr path (at least in java), either. > > On Fri, Oct 9, 2020 at 2:25 PM Micah Kornfield <[email protected]> > wrote: > >> The big win in v2 pages (if I remember correctly) is that the variable >>> length encoding is no longer interleaved. That would provide a big >>> performance lift when pulling into arrow vectors (and variable length >>> decoding typically dominates total read processing time, on average I've >>> seen 5-10x per cell cpu cost increase for variable reads over scalar >>> reads). AFAIK, there is still no option for that in V1. >> >> >> For Delta-length byte the documentation [1] states "This encoding is >> always preferred over PLAIN for byte array columns." makes it sound like >> this could be part of V1 or V2. The encoding enum in the Thrift file [2] >> doesn't seem to document this either. >> >> Is there clearer documentation for what encodings are considered part of >> v1 vs v2? >> >> Thanks, >> Micah >> >> >> [1] >> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6 >> [2] >> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407 >> >> On Fri, Oct 9, 2020 at 11:37 AM Jacques Nadeau <[email protected]> >> wrote: >> >>> The big win in v2 pages (if I remember correctly) is that the variable >>> length encoding is no longer interleaved. That would provide a big >>> performance lift when pulling into arrow vectors (and variable length >>> decoding typically dominates total read processing time, on average I've >>> seen 5-10x per cell cpu cost increase for variable reads over scalar >>> reads). AFAIK, there is still no option for that in V1. >>> >>> On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield <[email protected]> >>> wrote: >>> >>>> Thanks for the quick reply Ryan. >>>> >>>> >>>> > We only use v1 and it still works well. That said, I'd love to make >>>> some >>>> > progress on better encodings and finalizing v2 so we can use them! >>>> >>>> >>>> Are there JIRAs or other documentation that is tracking this work? >>>> >>>> Thanks, >>>> Micah >>>> >>>> >>>> On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue <[email protected]> wrote: >>>> >>>> > While there isn't anything wrong with it, the same challenges have >>>> been >>>> > solved in different ways with v1 pages. The main difference is that v2 >>>> > pages are broken at record boundaries, and v1 pages weren't >>>> guaranteed to >>>> > be. But, in order to write page indexes near the footer, breaking >>>> pages at >>>> > record boundaries is required. So you know if you have page indexes, >>>> you >>>> > can actually use them to skip through pages safely. That removes much >>>> of >>>> > the need for v2 pages. >>>> > >>>> > The main drawback to using v2 pages is that the v2 spec is >>>> unfinished, and >>>> > I don't think there is a way to use just the new pages. So you'd >>>> possibly >>>> > end up pulling in other beta features that probably shouldn't be used >>>> if >>>> > you want to stick with what is required for compatibility across >>>> > implementations. >>>> > >>>> > We only use v1 and it still works well. That said, I'd love to make >>>> some >>>> > progress on better encodings and finalizing v2 so we can use them! >>>> > >>>> > On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield < >>>> [email protected]> >>>> > wrote: >>>> > >>>> >> What is the current status of support for Data Page V2? Is it >>>> recommended >>>> >> for production workloads? >>>> >> >>>> >> Thanks, >>>> >> Micah >>>> >> >>>> > >>>> > >>>> > -- >>>> > Ryan Blue >>>> > Software Engineer >>>> > Netflix >>>> > >>>> >>>
