It is still not clear to me if we want to recommend V2 for production use at all or simply introduce the new encodings for V1. I would suggest discussing this topic on the parquet sync next Tuesday.
On Thu, Oct 22, 2020 at 6:04 AM Micah Kornfield <[email protected]> wrote: > I've created https://github.com/apache/parquet-format/pull/163 to try to > document these (note I really don't have historical context here so please > review carefully). > > I would appreciate it if someone could point me to a reference on what the > current status of V2 is? What is left unsettled? When can we start > recommending it for production use? > > Thanks, > Micah > > On Tue, Oct 13, 2020 at 9:23 AM Micah Kornfield <[email protected]> > wrote: > > > I am not sure 2.0 means the v2 pages here. I think there was/is a bit of > >> confusion between the v1/v2 pages and the parquet-mr releases. Maybe the > >> parquet-format releases are also part of it. > > > > > > +1 to the confusion part. The reason why I originally started this > thread > > is that none of this is entirely clear to me from existing documentation. > > > > In particular it is confusing to me to say that the V2 Spec is not yet > > finished when it looks like there have been multiple V2 Format releases. > > > > It would be extremely useful to have documentation relating features to: > > 1. The version of the spec they are part of > > 2. There current status in reference implementations > > > > Thanks, > > Micah > > > > > > On Tue, Oct 13, 2020 at 1:51 AM Gabor Szadovszky <[email protected]> > wrote: > > > >> I am not sure 2.0 means the v2 pages here. I think there was/is a bit of > >> confusion between the v1/v2 pages and the parquet-mr releases. Maybe the > >> parquet-format releases are also part of it. > >> In this table many features are not related to the pages so I don't > think > >> the "Expected release" meant the v1/v2 pages. I guess there was an > earlier > >> plan to release parquet-mr 2.0 with the v2 pages but then v2 pages were > >> released in a 1.x release while 2.0 is not planned yet. (I was not in > the > >> community that time so I'm only guessing.) > >> > >> Also worth to mention that it seems to be not related to the > >> parquet-format > >> releases which means that based on the spec the implementations were/are > >> not limited by this table. > >> > >> > >> On Mon, Oct 12, 2020 at 6:53 PM Ryan Blue <[email protected]> > >> wrote: > >> > >> > I remembered that there used to be a table. Looks like it was removed: > >> > > >> > > >> > https://github.com/apache/parquet-mr/commit/aed9097640c7adffe1151b32e86b5efc3702c657?short_path=b335630#diff-04c6e90faac2675aa89e2176d2eec7d8 > >> > > >> > The table used to list delta as a 2.0 feature. > >> > > >> > On Mon, Oct 12, 2020 at 1:38 AM Gabor Szadovszky <[email protected]> > >> wrote: > >> > > >> > > That answer I wrote to the other thread was based on the current > >> code. So > >> > > that is how parquet-mr is working now. It does not mean though how > >> shall > >> > it > >> > > work or how it works in other implementations. Unfortunately, the > spec > >> > does > >> > > not say anything about v1 and v2 in the context of encodings. > >> > > Meanwhile, enabling the "new" encodings in v1 may generate > >> compatibility > >> > > issues with other implementations. (I am not sure how would the > >> existing > >> > > releases of parquet-mr behave if they have to read v1 pages with > these > >> > > encodings but I believe it would work fine.) > >> > > > >> > > I think, it would be a good idea to keep the existing default > >> behavior as > >> > > is but introduce some new flags where the user may set/suggest > >> encodings > >> > > for the different columns. This way the user can hold the risk of > >> being > >> > > potentially incompatible with other implementations (for the time > >> being) > >> > > and also can fine tune the encodings for the data. This way we can > >> also > >> > > introduce some new encodings that are better in some cases (e.g. > lossy > >> > > compression for floating point numbers). > >> > > > >> > > What do you guys thing? > >> > > (I would be happy to help anyone would like to contribute in this > >> topic.) > >> > > > >> > > Cheers, > >> > > Gabor > >> > > > >> > > On Sat, Oct 10, 2020 at 2:18 AM Jacques Nadeau <[email protected]> > >> > wrote: > >> > > > >> > > > Gabor seems to agree that delta is V2 only. > >> > > > > >> > > > To summarize, no delta encodings are used for V1 pages. They are > >> > > available > >> > > > > for V2 only. > >> > > > > >> > > > > >> > > > https://www.mail-archive.com/[email protected]/msg11826.html > >> > > > > >> > > > > >> > > > > >> > > > On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau <[email protected] > > > >> > > wrote: > >> > > > > >> > > > > Good point. I had mentally categorized this as V2, not based on > >> the > >> > > docs? > >> > > > > > >> > > > > I don't think most tools write this but I can't see anywhere > that > >> it > >> > > says > >> > > > > it is limited to v2 readers/writers. I'm not sure how many tools > >> > > > vectorize > >> > > > > read it versus delegate to the legacy mr path (at least in > java), > >> > > either. > >> > > > > > >> > > > > On Fri, Oct 9, 2020 at 2:25 PM Micah Kornfield < > >> > [email protected]> > >> > > > > wrote: > >> > > > > > >> > > > >> The big win in v2 pages (if I remember correctly) is that the > >> > variable > >> > > > >>> length encoding is no longer interleaved. That would provide a > >> big > >> > > > >>> performance lift when pulling into arrow vectors (and variable > >> > length > >> > > > >>> decoding typically dominates total read processing time, on > >> average > >> > > > I've > >> > > > >>> seen 5-10x per cell cpu cost increase for variable reads over > >> > scalar > >> > > > >>> reads). AFAIK, there is still no option for that in V1. > >> > > > >> > >> > > > >> > >> > > > >> For Delta-length byte the documentation [1] states "This > >> encoding > >> > is > >> > > > >> always preferred over PLAIN for byte array columns." makes it > >> sound > >> > > like > >> > > > >> this could be part of V1 or V2. The encoding enum in the > Thrift > >> > file > >> > > > [2] > >> > > > >> doesn't seem to document this either. > >> > > > >> > >> > > > >> Is there clearer documentation for what encodings are > considered > >> > part > >> > > of > >> > > > >> v1 vs v2? > >> > > > >> > >> > > > >> Thanks, > >> > > > >> Micah > >> > > > >> > >> > > > >> > >> > > > >> [1] > >> > > > >> > >> > > > > >> > > > >> > > >> > https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6 > >> > > > >> [2] > >> > > > >> > >> > > > > >> > > > >> > > >> > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407 > >> > > > >> > >> > > > >> On Fri, Oct 9, 2020 at 11:37 AM Jacques Nadeau < > >> [email protected]> > >> > > > >> wrote: > >> > > > >> > >> > > > >>> The big win in v2 pages (if I remember correctly) is that the > >> > > variable > >> > > > >>> length encoding is no longer interleaved. That would provide a > >> big > >> > > > >>> performance lift when pulling into arrow vectors (and variable > >> > length > >> > > > >>> decoding typically dominates total read processing time, on > >> average > >> > > > I've > >> > > > >>> seen 5-10x per cell cpu cost increase for variable reads over > >> > scalar > >> > > > >>> reads). AFAIK, there is still no option for that in V1. > >> > > > >>> > >> > > > >>> On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield < > >> > > [email protected] > >> > > > > > >> > > > >>> wrote: > >> > > > >>> > >> > > > >>>> Thanks for the quick reply Ryan. > >> > > > >>>> > >> > > > >>>> > >> > > > >>>> > We only use v1 and it still works well. That said, I'd love > >> to > >> > > make > >> > > > >>>> some > >> > > > >>>> > progress on better encodings and finalizing v2 so we can > use > >> > them! > >> > > > >>>> > >> > > > >>>> > >> > > > >>>> Are there JIRAs or other documentation that is tracking this > >> work? > >> > > > >>>> > >> > > > >>>> Thanks, > >> > > > >>>> Micah > >> > > > >>>> > >> > > > >>>> > >> > > > >>>> On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue <[email protected] > > > >> > > wrote: > >> > > > >>>> > >> > > > >>>> > While there isn't anything wrong with it, the same > challenges > >> > have > >> > > > >>>> been > >> > > > >>>> > solved in different ways with v1 pages. The main difference > >> is > >> > > that > >> > > > v2 > >> > > > >>>> > pages are broken at record boundaries, and v1 pages weren't > >> > > > >>>> guaranteed to > >> > > > >>>> > be. But, in order to write page indexes near the footer, > >> > breaking > >> > > > >>>> pages at > >> > > > >>>> > record boundaries is required. So you know if you have page > >> > > indexes, > >> > > > >>>> you > >> > > > >>>> > can actually use them to skip through pages safely. That > >> removes > >> > > > much > >> > > > >>>> of > >> > > > >>>> > the need for v2 pages. > >> > > > >>>> > > >> > > > >>>> > The main drawback to using v2 pages is that the v2 spec is > >> > > > >>>> unfinished, and > >> > > > >>>> > I don't think there is a way to use just the new pages. So > >> you'd > >> > > > >>>> possibly > >> > > > >>>> > end up pulling in other beta features that probably > >> shouldn't be > >> > > > used > >> > > > >>>> if > >> > > > >>>> > you want to stick with what is required for compatibility > >> across > >> > > > >>>> > implementations. > >> > > > >>>> > > >> > > > >>>> > We only use v1 and it still works well. That said, I'd love > >> to > >> > > make > >> > > > >>>> some > >> > > > >>>> > progress on better encodings and finalizing v2 so we can > use > >> > them! > >> > > > >>>> > > >> > > > >>>> > On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield < > >> > > > >>>> [email protected]> > >> > > > >>>> > wrote: > >> > > > >>>> > > >> > > > >>>> >> What is the current status of support for Data Page V2? > Is > >> it > >> > > > >>>> recommended > >> > > > >>>> >> for production workloads? > >> > > > >>>> >> > >> > > > >>>> >> Thanks, > >> > > > >>>> >> Micah > >> > > > >>>> >> > >> > > > >>>> > > >> > > > >>>> > > >> > > > >>>> > -- > >> > > > >>>> > Ryan Blue > >> > > > >>>> > Software Engineer > >> > > > >>>> > Netflix > >> > > > >>>> > > >> > > > >>>> > >> > > > >>> > >> > > > > >> > > > >> > > >> > > >> > -- > >> > Ryan Blue > >> > Software Engineer > >> > Netflix > >> > > >> > > >
