I remembered that there used to be a table. Looks like it was removed: https://github.com/apache/parquet-mr/commit/aed9097640c7adffe1151b32e86b5efc3702c657?short_path=b335630#diff-04c6e90faac2675aa89e2176d2eec7d8
The table used to list delta as a 2.0 feature. On Mon, Oct 12, 2020 at 1:38 AM Gabor Szadovszky <[email protected]> wrote: > That answer I wrote to the other thread was based on the current code. So > that is how parquet-mr is working now. It does not mean though how shall it > work or how it works in other implementations. Unfortunately, the spec does > not say anything about v1 and v2 in the context of encodings. > Meanwhile, enabling the "new" encodings in v1 may generate compatibility > issues with other implementations. (I am not sure how would the existing > releases of parquet-mr behave if they have to read v1 pages with these > encodings but I believe it would work fine.) > > I think, it would be a good idea to keep the existing default behavior as > is but introduce some new flags where the user may set/suggest encodings > for the different columns. This way the user can hold the risk of being > potentially incompatible with other implementations (for the time being) > and also can fine tune the encodings for the data. This way we can also > introduce some new encodings that are better in some cases (e.g. lossy > compression for floating point numbers). > > What do you guys thing? > (I would be happy to help anyone would like to contribute in this topic.) > > Cheers, > Gabor > > On Sat, Oct 10, 2020 at 2:18 AM Jacques Nadeau <[email protected]> wrote: > > > Gabor seems to agree that delta is V2 only. > > > > To summarize, no delta encodings are used for V1 pages. They are > available > > > for V2 only. > > > > > > https://www.mail-archive.com/[email protected]/msg11826.html > > > > > > > > On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau <[email protected]> > wrote: > > > > > Good point. I had mentally categorized this as V2, not based on the > docs? > > > > > > I don't think most tools write this but I can't see anywhere that it > says > > > it is limited to v2 readers/writers. I'm not sure how many tools > > vectorize > > > read it versus delegate to the legacy mr path (at least in java), > either. > > > > > > On Fri, Oct 9, 2020 at 2:25 PM Micah Kornfield <[email protected]> > > > wrote: > > > > > >> The big win in v2 pages (if I remember correctly) is that the variable > > >>> length encoding is no longer interleaved. That would provide a big > > >>> performance lift when pulling into arrow vectors (and variable length > > >>> decoding typically dominates total read processing time, on average > > I've > > >>> seen 5-10x per cell cpu cost increase for variable reads over scalar > > >>> reads). AFAIK, there is still no option for that in V1. > > >> > > >> > > >> For Delta-length byte the documentation [1] states "This encoding is > > >> always preferred over PLAIN for byte array columns." makes it sound > like > > >> this could be part of V1 or V2. The encoding enum in the Thrift file > > [2] > > >> doesn't seem to document this either. > > >> > > >> Is there clearer documentation for what encodings are considered part > of > > >> v1 vs v2? > > >> > > >> Thanks, > > >> Micah > > >> > > >> > > >> [1] > > >> > > > https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6 > > >> [2] > > >> > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407 > > >> > > >> On Fri, Oct 9, 2020 at 11:37 AM Jacques Nadeau <[email protected]> > > >> wrote: > > >> > > >>> The big win in v2 pages (if I remember correctly) is that the > variable > > >>> length encoding is no longer interleaved. That would provide a big > > >>> performance lift when pulling into arrow vectors (and variable length > > >>> decoding typically dominates total read processing time, on average > > I've > > >>> seen 5-10x per cell cpu cost increase for variable reads over scalar > > >>> reads). AFAIK, there is still no option for that in V1. > > >>> > > >>> On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield < > [email protected] > > > > > >>> wrote: > > >>> > > >>>> Thanks for the quick reply Ryan. > > >>>> > > >>>> > > >>>> > We only use v1 and it still works well. That said, I'd love to > make > > >>>> some > > >>>> > progress on better encodings and finalizing v2 so we can use them! > > >>>> > > >>>> > > >>>> Are there JIRAs or other documentation that is tracking this work? > > >>>> > > >>>> Thanks, > > >>>> Micah > > >>>> > > >>>> > > >>>> On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue <[email protected]> > wrote: > > >>>> > > >>>> > While there isn't anything wrong with it, the same challenges have > > >>>> been > > >>>> > solved in different ways with v1 pages. The main difference is > that > > v2 > > >>>> > pages are broken at record boundaries, and v1 pages weren't > > >>>> guaranteed to > > >>>> > be. But, in order to write page indexes near the footer, breaking > > >>>> pages at > > >>>> > record boundaries is required. So you know if you have page > indexes, > > >>>> you > > >>>> > can actually use them to skip through pages safely. That removes > > much > > >>>> of > > >>>> > the need for v2 pages. > > >>>> > > > >>>> > The main drawback to using v2 pages is that the v2 spec is > > >>>> unfinished, and > > >>>> > I don't think there is a way to use just the new pages. So you'd > > >>>> possibly > > >>>> > end up pulling in other beta features that probably shouldn't be > > used > > >>>> if > > >>>> > you want to stick with what is required for compatibility across > > >>>> > implementations. > > >>>> > > > >>>> > We only use v1 and it still works well. That said, I'd love to > make > > >>>> some > > >>>> > progress on better encodings and finalizing v2 so we can use them! > > >>>> > > > >>>> > On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield < > > >>>> [email protected]> > > >>>> > wrote: > > >>>> > > > >>>> >> What is the current status of support for Data Page V2? Is it > > >>>> recommended > > >>>> >> for production workloads? > > >>>> >> > > >>>> >> Thanks, > > >>>> >> Micah > > >>>> >> > > >>>> > > > >>>> > > > >>>> > -- > > >>>> > Ryan Blue > > >>>> > Software Engineer > > >>>> > Netflix > > >>>> > > > >>>> > > >>> > > > -- Ryan Blue Software Engineer Netflix
