Good point. I had mentally categorized this as V2, not based on the docs? I don't think most tools write this but I can't see anywhere that it says it is limited to v2 readers/writers. I'm not sure how many tools vectorize read it versus delegate to the legacy mr path (at least in java), either.
On Fri, Oct 9, 2020 at 2:25 PM Micah Kornfield <[email protected]> wrote: > The big win in v2 pages (if I remember correctly) is that the variable >> length encoding is no longer interleaved. That would provide a big >> performance lift when pulling into arrow vectors (and variable length >> decoding typically dominates total read processing time, on average I've >> seen 5-10x per cell cpu cost increase for variable reads over scalar >> reads). AFAIK, there is still no option for that in V1. > > > For Delta-length byte the documentation [1] states "This encoding is > always preferred over PLAIN for byte array columns." makes it sound like > this could be part of V1 or V2. The encoding enum in the Thrift file [2] > doesn't seem to document this either. > > Is there clearer documentation for what encodings are considered part of > v1 vs v2? > > Thanks, > Micah > > > [1] > https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6 > [2] > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407 > > On Fri, Oct 9, 2020 at 11:37 AM Jacques Nadeau <[email protected]> wrote: > >> The big win in v2 pages (if I remember correctly) is that the variable >> length encoding is no longer interleaved. That would provide a big >> performance lift when pulling into arrow vectors (and variable length >> decoding typically dominates total read processing time, on average I've >> seen 5-10x per cell cpu cost increase for variable reads over scalar >> reads). AFAIK, there is still no option for that in V1. >> >> On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield <[email protected]> >> wrote: >> >>> Thanks for the quick reply Ryan. >>> >>> >>> > We only use v1 and it still works well. That said, I'd love to make >>> some >>> > progress on better encodings and finalizing v2 so we can use them! >>> >>> >>> Are there JIRAs or other documentation that is tracking this work? >>> >>> Thanks, >>> Micah >>> >>> >>> On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue <[email protected]> wrote: >>> >>> > While there isn't anything wrong with it, the same challenges have been >>> > solved in different ways with v1 pages. The main difference is that v2 >>> > pages are broken at record boundaries, and v1 pages weren't guaranteed >>> to >>> > be. But, in order to write page indexes near the footer, breaking >>> pages at >>> > record boundaries is required. So you know if you have page indexes, >>> you >>> > can actually use them to skip through pages safely. That removes much >>> of >>> > the need for v2 pages. >>> > >>> > The main drawback to using v2 pages is that the v2 spec is unfinished, >>> and >>> > I don't think there is a way to use just the new pages. So you'd >>> possibly >>> > end up pulling in other beta features that probably shouldn't be used >>> if >>> > you want to stick with what is required for compatibility across >>> > implementations. >>> > >>> > We only use v1 and it still works well. That said, I'd love to make >>> some >>> > progress on better encodings and finalizing v2 so we can use them! >>> > >>> > On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield <[email protected] >>> > >>> > wrote: >>> > >>> >> What is the current status of support for Data Page V2? Is it >>> recommended >>> >> for production workloads? >>> >> >>> >> Thanks, >>> >> Micah >>> >> >>> > >>> > >>> > -- >>> > Ryan Blue >>> > Software Engineer >>> > Netflix >>> > >>> >>
