Re: Current status of Data Page V2?

2020-10-26 Thread Gabor Szadovszky
Hi Micah,

V2 pages are not only about the new encodings but a couple of other things
the community thought would be better than V1.
One of these improvements was to break pages at row boundaries. This one is
outdated because during the development of column-indexes we had to
implement the same for V1 pages.
Another improvement is to store the repetition and definition levels
separately so they can be read without decompressing data. I don't know if
we still think it has significant benefits but we never implemented
anything that makes use of it.
The default encodings are also differences in V1 and V2 pages but they are
not properly specified. Also, the encodings are not tied to V1 or V2 by
design. It is only a matter of specification.

I don't think that anything is blocking V2 pages to be introduced in
production (I believe there are a couple of users who already use V2 pages
in production) but I think it isn't worth the effort. As far as I know
parquet-mr is the only one that supports V2 pages.

I agree it is complicated to introduce new encodings and keep backward
compatibility but this is how compression codecs work currently. Supporting
compression codecs is not only about supporting a specific parquet-format
release or having a specific parquet-mr release but you also have to have
codec implementation available. We might handle the encodings similarly
that there are default ones guaranteed would work and there are others that
might not be supported by older readers.
But this is only an example. We might want to introduce some version levels
or compatibility groups. What I want to say there is no reason to tie the
new codecs to V2 pages. We should handle the introduction of the new codecs
(and handling their compatibility) and the V2 pages separately.

It would be nice to hear other opinions on this topic.

Regards,
Gabor

On Thu, Oct 22, 2020 at 5:54 PM Micah Kornfield 
wrote:

> Hi Gabor,
>
> > It is still not clear to me if we want to recommend V2 for production use
> > at all
>
> Again, I'm missing context here, but what is blocking V2 for production
> use?  Is it specification finalization, implementation finalization?
> Something else?
>
> or simply introduce the new encodings for V1.
>
>
> This doesn't seem simple to me.  It will break V1 readers that don't
> support these encodings. It also sounds like this is taking the frame or
> reference of parquet-mr, and not Parquet as a specification.  My main goal
> in this thread is to try to capture information, so we can get better
> specification docs (and to potentially help drive consensus on V2).
>
> There was another recent e-mail thread [1] that also pointed out our gap in
> explaining how versioning works and what is expected to be included in
> compliant reader/writers (if there are docs like this I apologize if I
> missed them and would appreciate pointers).
>
>  I would suggest
> > discussing this topic on the parquet sync next Tuesday.
>
>
> I will try to make it but I think I might have a conflict.  In this case I
> think continuing the email thread would be useful, so we have a good
> historical snapshot of the discussion (a lot of context gets lost in sync
> notes in my experience).
>
> [1]
>
> https://mail-archives.apache.org/mod_mbox/parquet-dev/202010.mbox/%3CMWHPR03MB31842468BE51022392B7F2D1CE020%40MWHPR03MB3184.namprd03.prod.outlook.com%3E
>
>
> On Thursday, October 22, 2020, Gabor Szadovszky  wrote:
>
> > It is still not clear to me if we want to recommend V2 for production use
> > at all or simply introduce the new encodings for V1. I would suggest
> > discussing this topic on the parquet sync next Tuesday.
> >
> > On Thu, Oct 22, 2020 at 6:04 AM Micah Kornfield 
> > wrote:
> >
> > > I've created https://github.com/apache/parquet-format/pull/163 to try
> to
> > > document these (note I really don't have historical context here so
> > please
> > > review carefully).
> > >
> > > I would appreciate it if someone could point me to a reference on what
> > the
> > > current status of V2 is?  What is left unsettled? When can we start
> > > recommending it for production use?
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Tue, Oct 13, 2020 at 9:23 AM Micah Kornfield  >
> > > wrote:
> > >
> > > > I am not sure 2.0 means the v2 pages here. I think there was/is a bit
> > of
> > > >> confusion between the v1/v2 pages and the parquet-mr releases. Maybe
> > the
> > > >> parquet-format releases are also part of it.
> > > >
> > > >
> > > > +1 to the confusion part.  The reason why I originally started this
> > > thread
> > > > is that none of this is entirely clear to me from existing
> > documentation.
> > > >
> > > > In particular it is confusing to me to say that the V2 Spec is not
> yet
> > > > finished when it looks like there have been multiple V2 Format
> > releases.
> > > >
> > > > It would be extremely useful to have documentation relating features
> > to:
> > > > 1.  The version of the spec they are part of
> > > > 2.  There current 

Re: Current status of Data Page V2?

2020-10-22 Thread Micah Kornfield
Hi Gabor,

> It is still not clear to me if we want to recommend V2 for production use
> at all

Again, I'm missing context here, but what is blocking V2 for production
use?  Is it specification finalization, implementation finalization?
Something else?

or simply introduce the new encodings for V1.


This doesn't seem simple to me.  It will break V1 readers that don't
support these encodings. It also sounds like this is taking the frame or
reference of parquet-mr, and not Parquet as a specification.  My main goal
in this thread is to try to capture information, so we can get better
specification docs (and to potentially help drive consensus on V2).

There was another recent e-mail thread [1] that also pointed out our gap in
explaining how versioning works and what is expected to be included in
compliant reader/writers (if there are docs like this I apologize if I
missed them and would appreciate pointers).

 I would suggest
> discussing this topic on the parquet sync next Tuesday.


I will try to make it but I think I might have a conflict.  In this case I
think continuing the email thread would be useful, so we have a good
historical snapshot of the discussion (a lot of context gets lost in sync
notes in my experience).

[1]
https://mail-archives.apache.org/mod_mbox/parquet-dev/202010.mbox/%3CMWHPR03MB31842468BE51022392B7F2D1CE020%40MWHPR03MB3184.namprd03.prod.outlook.com%3E


On Thursday, October 22, 2020, Gabor Szadovszky  wrote:

> It is still not clear to me if we want to recommend V2 for production use
> at all or simply introduce the new encodings for V1. I would suggest
> discussing this topic on the parquet sync next Tuesday.
>
> On Thu, Oct 22, 2020 at 6:04 AM Micah Kornfield 
> wrote:
>
> > I've created https://github.com/apache/parquet-format/pull/163 to try to
> > document these (note I really don't have historical context here so
> please
> > review carefully).
> >
> > I would appreciate it if someone could point me to a reference on what
> the
> > current status of V2 is?  What is left unsettled? When can we start
> > recommending it for production use?
> >
> > Thanks,
> > Micah
> >
> > On Tue, Oct 13, 2020 at 9:23 AM Micah Kornfield 
> > wrote:
> >
> > > I am not sure 2.0 means the v2 pages here. I think there was/is a bit
> of
> > >> confusion between the v1/v2 pages and the parquet-mr releases. Maybe
> the
> > >> parquet-format releases are also part of it.
> > >
> > >
> > > +1 to the confusion part.  The reason why I originally started this
> > thread
> > > is that none of this is entirely clear to me from existing
> documentation.
> > >
> > > In particular it is confusing to me to say that the V2 Spec is not yet
> > > finished when it looks like there have been multiple V2 Format
> releases.
> > >
> > > It would be extremely useful to have documentation relating features
> to:
> > > 1.  The version of the spec they are part of
> > > 2.  There current status in reference implementations
> > >
> > > Thanks,
> > > Micah
> > >
> > >
> > > On Tue, Oct 13, 2020 at 1:51 AM Gabor Szadovszky 
> > wrote:
> > >
> > >> I am not sure 2.0 means the v2 pages here. I think there was/is a bit
> of
> > >> confusion between the v1/v2 pages and the parquet-mr releases. Maybe
> the
> > >> parquet-format releases are also part of it.
> > >> In this table many features are not related to the pages so I don't
> > think
> > >> the "Expected release" meant the v1/v2 pages. I guess there was an
> > earlier
> > >> plan to release parquet-mr 2.0 with the v2 pages but then v2 pages
> were
> > >> released in a 1.x release while 2.0 is not planned yet. (I was not in
> > the
> > >> community that time so I'm only guessing.)
> > >>
> > >> Also worth to mention that it seems to be not related to the
> > >> parquet-format
> > >> releases which means that based on the spec the implementations
> were/are
> > >> not limited by this table.
> > >>
> > >>
> > >> On Mon, Oct 12, 2020 at 6:53 PM Ryan Blue 
> > >> wrote:
> > >>
> > >> > I remembered that there used to be a table. Looks like it was
> removed:
> > >> >
> > >> >
> > >>
> >
> https://github.com/apache/parquet-mr/commit/aed9097640c7adffe1151b32e86b5efc3702c657?short_path=b335630#diff-04c6e90faac2675aa89e2176d2eec7d8
> > >> >
> > >> > The table used to list delta as a 2.0 feature.
> > >> >
> > >> > On Mon, Oct 12, 2020 at 1:38 AM Gabor Szadovszky 
> > >> wrote:
> > >> >
> > >> > > That answer I wrote to the other thread was based on the current
> > >> code. So
> > >> > > that is how parquet-mr is working now. It does not mean though how
> > >> shall
> > >> > it
> > >> > > work or how it works in other implementations. Unfortunately, the
> > spec
> > >> > does
> > >> > > not say anything about v1 and v2 in the context of encodings.
> > >> > > Meanwhile, enabling the "new" encodings in v1 may generate
> > >> compatibility
> > >> > > issues with other implementations. (I am not sure how would the
> > >> existing
> > >> > > releases of parquet-mr behave if they have to 

Re: Current status of Data Page V2?

2020-10-22 Thread Gabor Szadovszky
It is still not clear to me if we want to recommend V2 for production use
at all or simply introduce the new encodings for V1. I would suggest
discussing this topic on the parquet sync next Tuesday.

On Thu, Oct 22, 2020 at 6:04 AM Micah Kornfield 
wrote:

> I've created https://github.com/apache/parquet-format/pull/163 to try to
> document these (note I really don't have historical context here so please
> review carefully).
>
> I would appreciate it if someone could point me to a reference on what the
> current status of V2 is?  What is left unsettled? When can we start
> recommending it for production use?
>
> Thanks,
> Micah
>
> On Tue, Oct 13, 2020 at 9:23 AM Micah Kornfield 
> wrote:
>
> > I am not sure 2.0 means the v2 pages here. I think there was/is a bit of
> >> confusion between the v1/v2 pages and the parquet-mr releases. Maybe the
> >> parquet-format releases are also part of it.
> >
> >
> > +1 to the confusion part.  The reason why I originally started this
> thread
> > is that none of this is entirely clear to me from existing documentation.
> >
> > In particular it is confusing to me to say that the V2 Spec is not yet
> > finished when it looks like there have been multiple V2 Format releases.
> >
> > It would be extremely useful to have documentation relating features to:
> > 1.  The version of the spec they are part of
> > 2.  There current status in reference implementations
> >
> > Thanks,
> > Micah
> >
> >
> > On Tue, Oct 13, 2020 at 1:51 AM Gabor Szadovszky 
> wrote:
> >
> >> I am not sure 2.0 means the v2 pages here. I think there was/is a bit of
> >> confusion between the v1/v2 pages and the parquet-mr releases. Maybe the
> >> parquet-format releases are also part of it.
> >> In this table many features are not related to the pages so I don't
> think
> >> the "Expected release" meant the v1/v2 pages. I guess there was an
> earlier
> >> plan to release parquet-mr 2.0 with the v2 pages but then v2 pages were
> >> released in a 1.x release while 2.0 is not planned yet. (I was not in
> the
> >> community that time so I'm only guessing.)
> >>
> >> Also worth to mention that it seems to be not related to the
> >> parquet-format
> >> releases which means that based on the spec the implementations were/are
> >> not limited by this table.
> >>
> >>
> >> On Mon, Oct 12, 2020 at 6:53 PM Ryan Blue 
> >> wrote:
> >>
> >> > I remembered that there used to be a table. Looks like it was removed:
> >> >
> >> >
> >>
> https://github.com/apache/parquet-mr/commit/aed9097640c7adffe1151b32e86b5efc3702c657?short_path=b335630#diff-04c6e90faac2675aa89e2176d2eec7d8
> >> >
> >> > The table used to list delta as a 2.0 feature.
> >> >
> >> > On Mon, Oct 12, 2020 at 1:38 AM Gabor Szadovszky 
> >> wrote:
> >> >
> >> > > That answer I wrote to the other thread was based on the current
> >> code. So
> >> > > that is how parquet-mr is working now. It does not mean though how
> >> shall
> >> > it
> >> > > work or how it works in other implementations. Unfortunately, the
> spec
> >> > does
> >> > > not say anything about v1 and v2 in the context of encodings.
> >> > > Meanwhile, enabling the "new" encodings in v1 may generate
> >> compatibility
> >> > > issues with other implementations. (I am not sure how would the
> >> existing
> >> > > releases of parquet-mr behave if they have to read v1 pages with
> these
> >> > > encodings but I believe it would work fine.)
> >> > >
> >> > > I think, it would be a good idea to keep the existing default
> >> behavior as
> >> > > is but introduce some new flags where the user may set/suggest
> >> encodings
> >> > > for the different columns. This way the user can hold the risk of
> >> being
> >> > > potentially incompatible with other implementations (for the time
> >> being)
> >> > > and also can fine tune the encodings for the data. This way we can
> >> also
> >> > > introduce some new encodings that are better in some cases (e.g.
> lossy
> >> > > compression for floating point numbers).
> >> > >
> >> > > What do you guys thing?
> >> > > (I would be happy to help anyone would like to contribute in this
> >> topic.)
> >> > >
> >> > > Cheers,
> >> > > Gabor
> >> > >
> >> > > On Sat, Oct 10, 2020 at 2:18 AM Jacques Nadeau 
> >> > wrote:
> >> > >
> >> > > > Gabor seems to agree that delta is V2 only.
> >> > > >
> >> > > > To summarize, no delta encodings are used for V1 pages. They are
> >> > > available
> >> > > > > for V2 only.
> >> > > >
> >> > > >
> >> > > > https://www.mail-archive.com/dev@parquet.apache.org/msg11826.html
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau  >
> >> > > wrote:
> >> > > >
> >> > > > > Good point. I had mentally categorized this as V2, not based on
> >> the
> >> > > docs?
> >> > > > >
> >> > > > > I don't think most tools write this but I can't see anywhere
> that
> >> it
> >> > > says
> >> > > > > it is limited to v2 readers/writers. I'm not sure how many tools
> >> > > > vectorize
> >> > > > > read it versus 

Re: Current status of Data Page V2?

2020-10-21 Thread Micah Kornfield
I've created https://github.com/apache/parquet-format/pull/163 to try to
document these (note I really don't have historical context here so please
review carefully).

I would appreciate it if someone could point me to a reference on what the
current status of V2 is?  What is left unsettled? When can we start
recommending it for production use?

Thanks,
Micah

On Tue, Oct 13, 2020 at 9:23 AM Micah Kornfield 
wrote:

> I am not sure 2.0 means the v2 pages here. I think there was/is a bit of
>> confusion between the v1/v2 pages and the parquet-mr releases. Maybe the
>> parquet-format releases are also part of it.
>
>
> +1 to the confusion part.  The reason why I originally started this thread
> is that none of this is entirely clear to me from existing documentation.
>
> In particular it is confusing to me to say that the V2 Spec is not yet
> finished when it looks like there have been multiple V2 Format releases.
>
> It would be extremely useful to have documentation relating features to:
> 1.  The version of the spec they are part of
> 2.  There current status in reference implementations
>
> Thanks,
> Micah
>
>
> On Tue, Oct 13, 2020 at 1:51 AM Gabor Szadovszky  wrote:
>
>> I am not sure 2.0 means the v2 pages here. I think there was/is a bit of
>> confusion between the v1/v2 pages and the parquet-mr releases. Maybe the
>> parquet-format releases are also part of it.
>> In this table many features are not related to the pages so I don't think
>> the "Expected release" meant the v1/v2 pages. I guess there was an earlier
>> plan to release parquet-mr 2.0 with the v2 pages but then v2 pages were
>> released in a 1.x release while 2.0 is not planned yet. (I was not in the
>> community that time so I'm only guessing.)
>>
>> Also worth to mention that it seems to be not related to the
>> parquet-format
>> releases which means that based on the spec the implementations were/are
>> not limited by this table.
>>
>>
>> On Mon, Oct 12, 2020 at 6:53 PM Ryan Blue 
>> wrote:
>>
>> > I remembered that there used to be a table. Looks like it was removed:
>> >
>> >
>> https://github.com/apache/parquet-mr/commit/aed9097640c7adffe1151b32e86b5efc3702c657?short_path=b335630#diff-04c6e90faac2675aa89e2176d2eec7d8
>> >
>> > The table used to list delta as a 2.0 feature.
>> >
>> > On Mon, Oct 12, 2020 at 1:38 AM Gabor Szadovszky 
>> wrote:
>> >
>> > > That answer I wrote to the other thread was based on the current
>> code. So
>> > > that is how parquet-mr is working now. It does not mean though how
>> shall
>> > it
>> > > work or how it works in other implementations. Unfortunately, the spec
>> > does
>> > > not say anything about v1 and v2 in the context of encodings.
>> > > Meanwhile, enabling the "new" encodings in v1 may generate
>> compatibility
>> > > issues with other implementations. (I am not sure how would the
>> existing
>> > > releases of parquet-mr behave if they have to read v1 pages with these
>> > > encodings but I believe it would work fine.)
>> > >
>> > > I think, it would be a good idea to keep the existing default
>> behavior as
>> > > is but introduce some new flags where the user may set/suggest
>> encodings
>> > > for the different columns. This way the user can hold the risk of
>> being
>> > > potentially incompatible with other implementations (for the time
>> being)
>> > > and also can fine tune the encodings for the data. This way we can
>> also
>> > > introduce some new encodings that are better in some cases (e.g. lossy
>> > > compression for floating point numbers).
>> > >
>> > > What do you guys thing?
>> > > (I would be happy to help anyone would like to contribute in this
>> topic.)
>> > >
>> > > Cheers,
>> > > Gabor
>> > >
>> > > On Sat, Oct 10, 2020 at 2:18 AM Jacques Nadeau 
>> > wrote:
>> > >
>> > > > Gabor seems to agree that delta is V2 only.
>> > > >
>> > > > To summarize, no delta encodings are used for V1 pages. They are
>> > > available
>> > > > > for V2 only.
>> > > >
>> > > >
>> > > > https://www.mail-archive.com/dev@parquet.apache.org/msg11826.html
>> > > >
>> > > >
>> > > >
>> > > > On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau 
>> > > wrote:
>> > > >
>> > > > > Good point. I had mentally categorized this as V2, not based on
>> the
>> > > docs?
>> > > > >
>> > > > > I don't think most tools write this but I can't see anywhere that
>> it
>> > > says
>> > > > > it is limited to v2 readers/writers. I'm not sure how many tools
>> > > > vectorize
>> > > > > read it versus delegate to the legacy mr path (at least in java),
>> > > either.
>> > > > >
>> > > > > On Fri, Oct 9, 2020 at 2:25 PM Micah Kornfield <
>> > emkornfi...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > >> The big win in v2 pages (if I remember correctly) is that the
>> > variable
>> > > > >>> length encoding is no longer interleaved. That would provide a
>> big
>> > > > >>> performance lift when pulling into arrow vectors (and variable
>> > length
>> > > > >>> decoding typically dominates total read processing time, 

Re: Current status of Data Page V2?

2020-10-13 Thread Micah Kornfield
>
> I am not sure 2.0 means the v2 pages here. I think there was/is a bit of
> confusion between the v1/v2 pages and the parquet-mr releases. Maybe the
> parquet-format releases are also part of it.


+1 to the confusion part.  The reason why I originally started this thread
is that none of this is entirely clear to me from existing documentation.

In particular it is confusing to me to say that the V2 Spec is not yet
finished when it looks like there have been multiple V2 Format releases.

It would be extremely useful to have documentation relating features to:
1.  The version of the spec they are part of
2.  There current status in reference implementations

Thanks,
Micah


On Tue, Oct 13, 2020 at 1:51 AM Gabor Szadovszky  wrote:

> I am not sure 2.0 means the v2 pages here. I think there was/is a bit of
> confusion between the v1/v2 pages and the parquet-mr releases. Maybe the
> parquet-format releases are also part of it.
> In this table many features are not related to the pages so I don't think
> the "Expected release" meant the v1/v2 pages. I guess there was an earlier
> plan to release parquet-mr 2.0 with the v2 pages but then v2 pages were
> released in a 1.x release while 2.0 is not planned yet. (I was not in the
> community that time so I'm only guessing.)
>
> Also worth to mention that it seems to be not related to the parquet-format
> releases which means that based on the spec the implementations were/are
> not limited by this table.
>
>
> On Mon, Oct 12, 2020 at 6:53 PM Ryan Blue 
> wrote:
>
> > I remembered that there used to be a table. Looks like it was removed:
> >
> >
> https://github.com/apache/parquet-mr/commit/aed9097640c7adffe1151b32e86b5efc3702c657?short_path=b335630#diff-04c6e90faac2675aa89e2176d2eec7d8
> >
> > The table used to list delta as a 2.0 feature.
> >
> > On Mon, Oct 12, 2020 at 1:38 AM Gabor Szadovszky 
> wrote:
> >
> > > That answer I wrote to the other thread was based on the current code.
> So
> > > that is how parquet-mr is working now. It does not mean though how
> shall
> > it
> > > work or how it works in other implementations. Unfortunately, the spec
> > does
> > > not say anything about v1 and v2 in the context of encodings.
> > > Meanwhile, enabling the "new" encodings in v1 may generate
> compatibility
> > > issues with other implementations. (I am not sure how would the
> existing
> > > releases of parquet-mr behave if they have to read v1 pages with these
> > > encodings but I believe it would work fine.)
> > >
> > > I think, it would be a good idea to keep the existing default behavior
> as
> > > is but introduce some new flags where the user may set/suggest
> encodings
> > > for the different columns. This way the user can hold the risk of being
> > > potentially incompatible with other implementations (for the time
> being)
> > > and also can fine tune the encodings for the data. This way we can also
> > > introduce some new encodings that are better in some cases (e.g. lossy
> > > compression for floating point numbers).
> > >
> > > What do you guys thing?
> > > (I would be happy to help anyone would like to contribute in this
> topic.)
> > >
> > > Cheers,
> > > Gabor
> > >
> > > On Sat, Oct 10, 2020 at 2:18 AM Jacques Nadeau 
> > wrote:
> > >
> > > > Gabor seems to agree that delta is V2 only.
> > > >
> > > > To summarize, no delta encodings are used for V1 pages. They are
> > > available
> > > > > for V2 only.
> > > >
> > > >
> > > > https://www.mail-archive.com/dev@parquet.apache.org/msg11826.html
> > > >
> > > >
> > > >
> > > > On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau 
> > > wrote:
> > > >
> > > > > Good point. I had mentally categorized this as V2, not based on the
> > > docs?
> > > > >
> > > > > I don't think most tools write this but I can't see anywhere that
> it
> > > says
> > > > > it is limited to v2 readers/writers. I'm not sure how many tools
> > > > vectorize
> > > > > read it versus delegate to the legacy mr path (at least in java),
> > > either.
> > > > >
> > > > > On Fri, Oct 9, 2020 at 2:25 PM Micah Kornfield <
> > emkornfi...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> The big win in v2 pages (if I remember correctly) is that the
> > variable
> > > > >>> length encoding is no longer interleaved. That would provide a
> big
> > > > >>> performance lift when pulling into arrow vectors (and variable
> > length
> > > > >>> decoding typically dominates total read processing time, on
> average
> > > > I've
> > > > >>> seen 5-10x per cell cpu cost increase for variable reads over
> > scalar
> > > > >>> reads). AFAIK, there is still no option for that in V1.
> > > > >>
> > > > >>
> > > > >> For Delta-length byte the documentation [1]   states "This
> encoding
> > is
> > > > >> always preferred over PLAIN for byte array columns." makes it
> sound
> > > like
> > > > >> this could be part of V1 or V2.   The encoding enum in the Thrift
> > file
> > > > [2]
> > > > >> doesn't seem to document this either.
> > > > >>
> > > > >> Is there 

Re: Current status of Data Page V2?

2020-10-13 Thread Gabor Szadovszky
I am not sure 2.0 means the v2 pages here. I think there was/is a bit of
confusion between the v1/v2 pages and the parquet-mr releases. Maybe the
parquet-format releases are also part of it.
In this table many features are not related to the pages so I don't think
the "Expected release" meant the v1/v2 pages. I guess there was an earlier
plan to release parquet-mr 2.0 with the v2 pages but then v2 pages were
released in a 1.x release while 2.0 is not planned yet. (I was not in the
community that time so I'm only guessing.)

Also worth to mention that it seems to be not related to the parquet-format
releases which means that based on the spec the implementations were/are
not limited by this table.


On Mon, Oct 12, 2020 at 6:53 PM Ryan Blue  wrote:

> I remembered that there used to be a table. Looks like it was removed:
>
> https://github.com/apache/parquet-mr/commit/aed9097640c7adffe1151b32e86b5efc3702c657?short_path=b335630#diff-04c6e90faac2675aa89e2176d2eec7d8
>
> The table used to list delta as a 2.0 feature.
>
> On Mon, Oct 12, 2020 at 1:38 AM Gabor Szadovszky  wrote:
>
> > That answer I wrote to the other thread was based on the current code. So
> > that is how parquet-mr is working now. It does not mean though how shall
> it
> > work or how it works in other implementations. Unfortunately, the spec
> does
> > not say anything about v1 and v2 in the context of encodings.
> > Meanwhile, enabling the "new" encodings in v1 may generate compatibility
> > issues with other implementations. (I am not sure how would the existing
> > releases of parquet-mr behave if they have to read v1 pages with these
> > encodings but I believe it would work fine.)
> >
> > I think, it would be a good idea to keep the existing default behavior as
> > is but introduce some new flags where the user may set/suggest encodings
> > for the different columns. This way the user can hold the risk of being
> > potentially incompatible with other implementations (for the time being)
> > and also can fine tune the encodings for the data. This way we can also
> > introduce some new encodings that are better in some cases (e.g. lossy
> > compression for floating point numbers).
> >
> > What do you guys thing?
> > (I would be happy to help anyone would like to contribute in this topic.)
> >
> > Cheers,
> > Gabor
> >
> > On Sat, Oct 10, 2020 at 2:18 AM Jacques Nadeau 
> wrote:
> >
> > > Gabor seems to agree that delta is V2 only.
> > >
> > > To summarize, no delta encodings are used for V1 pages. They are
> > available
> > > > for V2 only.
> > >
> > >
> > > https://www.mail-archive.com/dev@parquet.apache.org/msg11826.html
> > >
> > >
> > >
> > > On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau 
> > wrote:
> > >
> > > > Good point. I had mentally categorized this as V2, not based on the
> > docs?
> > > >
> > > > I don't think most tools write this but I can't see anywhere that it
> > says
> > > > it is limited to v2 readers/writers. I'm not sure how many tools
> > > vectorize
> > > > read it versus delegate to the legacy mr path (at least in java),
> > either.
> > > >
> > > > On Fri, Oct 9, 2020 at 2:25 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > > > wrote:
> > > >
> > > >> The big win in v2 pages (if I remember correctly) is that the
> variable
> > > >>> length encoding is no longer interleaved. That would provide a big
> > > >>> performance lift when pulling into arrow vectors (and variable
> length
> > > >>> decoding typically dominates total read processing time, on average
> > > I've
> > > >>> seen 5-10x per cell cpu cost increase for variable reads over
> scalar
> > > >>> reads). AFAIK, there is still no option for that in V1.
> > > >>
> > > >>
> > > >> For Delta-length byte the documentation [1]   states "This encoding
> is
> > > >> always preferred over PLAIN for byte array columns." makes it sound
> > like
> > > >> this could be part of V1 or V2.   The encoding enum in the Thrift
> file
> > > [2]
> > > >> doesn't seem to document this either.
> > > >>
> > > >> Is there clearer documentation for what encodings are considered
> part
> > of
> > > >> v1 vs v2?
> > > >>
> > > >> Thanks,
> > > >> Micah
> > > >>
> > > >>
> > > >> [1]
> > > >>
> > >
> >
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
> > > >> [2]
> > > >>
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407
> > > >>
> > > >> On Fri, Oct 9, 2020 at 11:37 AM Jacques Nadeau 
> > > >> wrote:
> > > >>
> > > >>> The big win in v2 pages (if I remember correctly) is that the
> > variable
> > > >>> length encoding is no longer interleaved. That would provide a big
> > > >>> performance lift when pulling into arrow vectors (and variable
> length
> > > >>> decoding typically dominates total read processing time, on average
> > > I've
> > > >>> seen 5-10x per cell cpu cost increase for variable reads over
> scalar
> > > >>> reads). AFAIK, there is still no option 

Re: Current status of Data Page V2?

2020-10-12 Thread Ryan Blue
I remembered that there used to be a table. Looks like it was removed:
https://github.com/apache/parquet-mr/commit/aed9097640c7adffe1151b32e86b5efc3702c657?short_path=b335630#diff-04c6e90faac2675aa89e2176d2eec7d8

The table used to list delta as a 2.0 feature.

On Mon, Oct 12, 2020 at 1:38 AM Gabor Szadovszky  wrote:

> That answer I wrote to the other thread was based on the current code. So
> that is how parquet-mr is working now. It does not mean though how shall it
> work or how it works in other implementations. Unfortunately, the spec does
> not say anything about v1 and v2 in the context of encodings.
> Meanwhile, enabling the "new" encodings in v1 may generate compatibility
> issues with other implementations. (I am not sure how would the existing
> releases of parquet-mr behave if they have to read v1 pages with these
> encodings but I believe it would work fine.)
>
> I think, it would be a good idea to keep the existing default behavior as
> is but introduce some new flags where the user may set/suggest encodings
> for the different columns. This way the user can hold the risk of being
> potentially incompatible with other implementations (for the time being)
> and also can fine tune the encodings for the data. This way we can also
> introduce some new encodings that are better in some cases (e.g. lossy
> compression for floating point numbers).
>
> What do you guys thing?
> (I would be happy to help anyone would like to contribute in this topic.)
>
> Cheers,
> Gabor
>
> On Sat, Oct 10, 2020 at 2:18 AM Jacques Nadeau  wrote:
>
> > Gabor seems to agree that delta is V2 only.
> >
> > To summarize, no delta encodings are used for V1 pages. They are
> available
> > > for V2 only.
> >
> >
> > https://www.mail-archive.com/dev@parquet.apache.org/msg11826.html
> >
> >
> >
> > On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau 
> wrote:
> >
> > > Good point. I had mentally categorized this as V2, not based on the
> docs?
> > >
> > > I don't think most tools write this but I can't see anywhere that it
> says
> > > it is limited to v2 readers/writers. I'm not sure how many tools
> > vectorize
> > > read it versus delegate to the legacy mr path (at least in java),
> either.
> > >
> > > On Fri, Oct 9, 2020 at 2:25 PM Micah Kornfield 
> > > wrote:
> > >
> > >> The big win in v2 pages (if I remember correctly) is that the variable
> > >>> length encoding is no longer interleaved. That would provide a big
> > >>> performance lift when pulling into arrow vectors (and variable length
> > >>> decoding typically dominates total read processing time, on average
> > I've
> > >>> seen 5-10x per cell cpu cost increase for variable reads over scalar
> > >>> reads). AFAIK, there is still no option for that in V1.
> > >>
> > >>
> > >> For Delta-length byte the documentation [1]   states "This encoding is
> > >> always preferred over PLAIN for byte array columns." makes it sound
> like
> > >> this could be part of V1 or V2.   The encoding enum in the Thrift file
> > [2]
> > >> doesn't seem to document this either.
> > >>
> > >> Is there clearer documentation for what encodings are considered part
> of
> > >> v1 vs v2?
> > >>
> > >> Thanks,
> > >> Micah
> > >>
> > >>
> > >> [1]
> > >>
> >
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
> > >> [2]
> > >>
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407
> > >>
> > >> On Fri, Oct 9, 2020 at 11:37 AM Jacques Nadeau 
> > >> wrote:
> > >>
> > >>> The big win in v2 pages (if I remember correctly) is that the
> variable
> > >>> length encoding is no longer interleaved. That would provide a big
> > >>> performance lift when pulling into arrow vectors (and variable length
> > >>> decoding typically dominates total read processing time, on average
> > I've
> > >>> seen 5-10x per cell cpu cost increase for variable reads over scalar
> > >>> reads). AFAIK, there is still no option for that in V1.
> > >>>
> > >>> On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield <
> emkornfi...@gmail.com
> > >
> > >>> wrote:
> > >>>
> >  Thanks for the quick reply Ryan.
> > 
> > 
> >  > We only use v1 and it still works well. That said, I'd love to
> make
> >  some
> >  > progress on better encodings and finalizing v2 so we can use them!
> > 
> > 
> >  Are there JIRAs or other documentation that is tracking this work?
> > 
> >  Thanks,
> >  Micah
> > 
> > 
> >  On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue 
> wrote:
> > 
> >  > While there isn't anything wrong with it, the same challenges have
> >  been
> >  > solved in different ways with v1 pages. The main difference is
> that
> > v2
> >  > pages are broken at record boundaries, and v1 pages weren't
> >  guaranteed to
> >  > be. But, in order to write page indexes near the footer, breaking
> >  pages at
> >  > record boundaries is required. So you know 

Re: Current status of Data Page V2?

2020-10-12 Thread Gabor Szadovszky
That answer I wrote to the other thread was based on the current code. So
that is how parquet-mr is working now. It does not mean though how shall it
work or how it works in other implementations. Unfortunately, the spec does
not say anything about v1 and v2 in the context of encodings.
Meanwhile, enabling the "new" encodings in v1 may generate compatibility
issues with other implementations. (I am not sure how would the existing
releases of parquet-mr behave if they have to read v1 pages with these
encodings but I believe it would work fine.)

I think, it would be a good idea to keep the existing default behavior as
is but introduce some new flags where the user may set/suggest encodings
for the different columns. This way the user can hold the risk of being
potentially incompatible with other implementations (for the time being)
and also can fine tune the encodings for the data. This way we can also
introduce some new encodings that are better in some cases (e.g. lossy
compression for floating point numbers).

What do you guys thing?
(I would be happy to help anyone would like to contribute in this topic.)

Cheers,
Gabor

On Sat, Oct 10, 2020 at 2:18 AM Jacques Nadeau  wrote:

> Gabor seems to agree that delta is V2 only.
>
> To summarize, no delta encodings are used for V1 pages. They are available
> > for V2 only.
>
>
> https://www.mail-archive.com/dev@parquet.apache.org/msg11826.html
>
>
>
> On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau  wrote:
>
> > Good point. I had mentally categorized this as V2, not based on the docs?
> >
> > I don't think most tools write this but I can't see anywhere that it says
> > it is limited to v2 readers/writers. I'm not sure how many tools
> vectorize
> > read it versus delegate to the legacy mr path (at least in java), either.
> >
> > On Fri, Oct 9, 2020 at 2:25 PM Micah Kornfield 
> > wrote:
> >
> >> The big win in v2 pages (if I remember correctly) is that the variable
> >>> length encoding is no longer interleaved. That would provide a big
> >>> performance lift when pulling into arrow vectors (and variable length
> >>> decoding typically dominates total read processing time, on average
> I've
> >>> seen 5-10x per cell cpu cost increase for variable reads over scalar
> >>> reads). AFAIK, there is still no option for that in V1.
> >>
> >>
> >> For Delta-length byte the documentation [1]   states "This encoding is
> >> always preferred over PLAIN for byte array columns." makes it sound like
> >> this could be part of V1 or V2.   The encoding enum in the Thrift file
> [2]
> >> doesn't seem to document this either.
> >>
> >> Is there clearer documentation for what encodings are considered part of
> >> v1 vs v2?
> >>
> >> Thanks,
> >> Micah
> >>
> >>
> >> [1]
> >>
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
> >> [2]
> >>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407
> >>
> >> On Fri, Oct 9, 2020 at 11:37 AM Jacques Nadeau 
> >> wrote:
> >>
> >>> The big win in v2 pages (if I remember correctly) is that the variable
> >>> length encoding is no longer interleaved. That would provide a big
> >>> performance lift when pulling into arrow vectors (and variable length
> >>> decoding typically dominates total read processing time, on average
> I've
> >>> seen 5-10x per cell cpu cost increase for variable reads over scalar
> >>> reads). AFAIK, there is still no option for that in V1.
> >>>
> >>> On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield  >
> >>> wrote:
> >>>
>  Thanks for the quick reply Ryan.
> 
> 
>  > We only use v1 and it still works well. That said, I'd love to make
>  some
>  > progress on better encodings and finalizing v2 so we can use them!
> 
> 
>  Are there JIRAs or other documentation that is tracking this work?
> 
>  Thanks,
>  Micah
> 
> 
>  On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue  wrote:
> 
>  > While there isn't anything wrong with it, the same challenges have
>  been
>  > solved in different ways with v1 pages. The main difference is that
> v2
>  > pages are broken at record boundaries, and v1 pages weren't
>  guaranteed to
>  > be. But, in order to write page indexes near the footer, breaking
>  pages at
>  > record boundaries is required. So you know if you have page indexes,
>  you
>  > can actually use them to skip through pages safely. That removes
> much
>  of
>  > the need for v2 pages.
>  >
>  > The main drawback to using v2 pages is that the v2 spec is
>  unfinished, and
>  > I don't think there is a way to use just the new pages. So you'd
>  possibly
>  > end up pulling in other beta features that probably shouldn't be
> used
>  if
>  > you want to stick with what is required for compatibility across
>  > implementations.
>  >
>  > We only use v1 and it still works well. 

Re: Current status of Data Page V2?

2020-10-09 Thread Jacques Nadeau
Gabor seems to agree that delta is V2 only.

To summarize, no delta encodings are used for V1 pages. They are available
> for V2 only.


https://www.mail-archive.com/dev@parquet.apache.org/msg11826.html



On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau  wrote:

> Good point. I had mentally categorized this as V2, not based on the docs?
>
> I don't think most tools write this but I can't see anywhere that it says
> it is limited to v2 readers/writers. I'm not sure how many tools vectorize
> read it versus delegate to the legacy mr path (at least in java), either.
>
> On Fri, Oct 9, 2020 at 2:25 PM Micah Kornfield 
> wrote:
>
>> The big win in v2 pages (if I remember correctly) is that the variable
>>> length encoding is no longer interleaved. That would provide a big
>>> performance lift when pulling into arrow vectors (and variable length
>>> decoding typically dominates total read processing time, on average I've
>>> seen 5-10x per cell cpu cost increase for variable reads over scalar
>>> reads). AFAIK, there is still no option for that in V1.
>>
>>
>> For Delta-length byte the documentation [1]   states "This encoding is
>> always preferred over PLAIN for byte array columns." makes it sound like
>> this could be part of V1 or V2.   The encoding enum in the Thrift file [2]
>> doesn't seem to document this either.
>>
>> Is there clearer documentation for what encodings are considered part of
>> v1 vs v2?
>>
>> Thanks,
>> Micah
>>
>>
>> [1]
>> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
>> [2]
>> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407
>>
>> On Fri, Oct 9, 2020 at 11:37 AM Jacques Nadeau 
>> wrote:
>>
>>> The big win in v2 pages (if I remember correctly) is that the variable
>>> length encoding is no longer interleaved. That would provide a big
>>> performance lift when pulling into arrow vectors (and variable length
>>> decoding typically dominates total read processing time, on average I've
>>> seen 5-10x per cell cpu cost increase for variable reads over scalar
>>> reads). AFAIK, there is still no option for that in V1.
>>>
>>> On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield 
>>> wrote:
>>>
 Thanks for the quick reply Ryan.


 > We only use v1 and it still works well. That said, I'd love to make
 some
 > progress on better encodings and finalizing v2 so we can use them!


 Are there JIRAs or other documentation that is tracking this work?

 Thanks,
 Micah


 On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue  wrote:

 > While there isn't anything wrong with it, the same challenges have
 been
 > solved in different ways with v1 pages. The main difference is that v2
 > pages are broken at record boundaries, and v1 pages weren't
 guaranteed to
 > be. But, in order to write page indexes near the footer, breaking
 pages at
 > record boundaries is required. So you know if you have page indexes,
 you
 > can actually use them to skip through pages safely. That removes much
 of
 > the need for v2 pages.
 >
 > The main drawback to using v2 pages is that the v2 spec is
 unfinished, and
 > I don't think there is a way to use just the new pages. So you'd
 possibly
 > end up pulling in other beta features that probably shouldn't be used
 if
 > you want to stick with what is required for compatibility across
 > implementations.
 >
 > We only use v1 and it still works well. That said, I'd love to make
 some
 > progress on better encodings and finalizing v2 so we can use them!
 >
 > On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield <
 emkornfi...@gmail.com>
 > wrote:
 >
 >> What is the current status of support for Data Page V2?  Is it
 recommended
 >> for production workloads?
 >>
 >> Thanks,
 >> Micah
 >>
 >
 >
 > --
 > Ryan Blue
 > Software Engineer
 > Netflix
 >

>>>


Re: Current status of Data Page V2?

2020-10-09 Thread Jacques Nadeau
Good point. I had mentally categorized this as V2, not based on the docs?

I don't think most tools write this but I can't see anywhere that it says
it is limited to v2 readers/writers. I'm not sure how many tools vectorize
read it versus delegate to the legacy mr path (at least in java), either.

On Fri, Oct 9, 2020 at 2:25 PM Micah Kornfield 
wrote:

> The big win in v2 pages (if I remember correctly) is that the variable
>> length encoding is no longer interleaved. That would provide a big
>> performance lift when pulling into arrow vectors (and variable length
>> decoding typically dominates total read processing time, on average I've
>> seen 5-10x per cell cpu cost increase for variable reads over scalar
>> reads). AFAIK, there is still no option for that in V1.
>
>
> For Delta-length byte the documentation [1]   states "This encoding is
> always preferred over PLAIN for byte array columns." makes it sound like
> this could be part of V1 or V2.   The encoding enum in the Thrift file [2]
> doesn't seem to document this either.
>
> Is there clearer documentation for what encodings are considered part of
> v1 vs v2?
>
> Thanks,
> Micah
>
>
> [1]
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
> [2]
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407
>
> On Fri, Oct 9, 2020 at 11:37 AM Jacques Nadeau  wrote:
>
>> The big win in v2 pages (if I remember correctly) is that the variable
>> length encoding is no longer interleaved. That would provide a big
>> performance lift when pulling into arrow vectors (and variable length
>> decoding typically dominates total read processing time, on average I've
>> seen 5-10x per cell cpu cost increase for variable reads over scalar
>> reads). AFAIK, there is still no option for that in V1.
>>
>> On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield 
>> wrote:
>>
>>> Thanks for the quick reply Ryan.
>>>
>>>
>>> > We only use v1 and it still works well. That said, I'd love to make
>>> some
>>> > progress on better encodings and finalizing v2 so we can use them!
>>>
>>>
>>> Are there JIRAs or other documentation that is tracking this work?
>>>
>>> Thanks,
>>> Micah
>>>
>>>
>>> On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue  wrote:
>>>
>>> > While there isn't anything wrong with it, the same challenges have been
>>> > solved in different ways with v1 pages. The main difference is that v2
>>> > pages are broken at record boundaries, and v1 pages weren't guaranteed
>>> to
>>> > be. But, in order to write page indexes near the footer, breaking
>>> pages at
>>> > record boundaries is required. So you know if you have page indexes,
>>> you
>>> > can actually use them to skip through pages safely. That removes much
>>> of
>>> > the need for v2 pages.
>>> >
>>> > The main drawback to using v2 pages is that the v2 spec is unfinished,
>>> and
>>> > I don't think there is a way to use just the new pages. So you'd
>>> possibly
>>> > end up pulling in other beta features that probably shouldn't be used
>>> if
>>> > you want to stick with what is required for compatibility across
>>> > implementations.
>>> >
>>> > We only use v1 and it still works well. That said, I'd love to make
>>> some
>>> > progress on better encodings and finalizing v2 so we can use them!
>>> >
>>> > On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield >> >
>>> > wrote:
>>> >
>>> >> What is the current status of support for Data Page V2?  Is it
>>> recommended
>>> >> for production workloads?
>>> >>
>>> >> Thanks,
>>> >> Micah
>>> >>
>>> >
>>> >
>>> > --
>>> > Ryan Blue
>>> > Software Engineer
>>> > Netflix
>>> >
>>>
>>


Re: Current status of Data Page V2?

2020-10-09 Thread Micah Kornfield
>
> The big win in v2 pages (if I remember correctly) is that the variable
> length encoding is no longer interleaved. That would provide a big
> performance lift when pulling into arrow vectors (and variable length
> decoding typically dominates total read processing time, on average I've
> seen 5-10x per cell cpu cost increase for variable reads over scalar
> reads). AFAIK, there is still no option for that in V1.


For Delta-length byte the documentation [1]   states "This encoding is
always preferred over PLAIN for byte array columns." makes it sound like
this could be part of V1 or V2.   The encoding enum in the Thrift file [2]
doesn't seem to document this either.

Is there clearer documentation for what encodings are considered part of v1
vs v2?

Thanks,
Micah


[1]
https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
[2]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407

On Fri, Oct 9, 2020 at 11:37 AM Jacques Nadeau  wrote:

> The big win in v2 pages (if I remember correctly) is that the variable
> length encoding is no longer interleaved. That would provide a big
> performance lift when pulling into arrow vectors (and variable length
> decoding typically dominates total read processing time, on average I've
> seen 5-10x per cell cpu cost increase for variable reads over scalar
> reads). AFAIK, there is still no option for that in V1.
>
> On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield 
> wrote:
>
>> Thanks for the quick reply Ryan.
>>
>>
>> > We only use v1 and it still works well. That said, I'd love to make some
>> > progress on better encodings and finalizing v2 so we can use them!
>>
>>
>> Are there JIRAs or other documentation that is tracking this work?
>>
>> Thanks,
>> Micah
>>
>>
>> On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue  wrote:
>>
>> > While there isn't anything wrong with it, the same challenges have been
>> > solved in different ways with v1 pages. The main difference is that v2
>> > pages are broken at record boundaries, and v1 pages weren't guaranteed
>> to
>> > be. But, in order to write page indexes near the footer, breaking pages
>> at
>> > record boundaries is required. So you know if you have page indexes, you
>> > can actually use them to skip through pages safely. That removes much of
>> > the need for v2 pages.
>> >
>> > The main drawback to using v2 pages is that the v2 spec is unfinished,
>> and
>> > I don't think there is a way to use just the new pages. So you'd
>> possibly
>> > end up pulling in other beta features that probably shouldn't be used if
>> > you want to stick with what is required for compatibility across
>> > implementations.
>> >
>> > We only use v1 and it still works well. That said, I'd love to make some
>> > progress on better encodings and finalizing v2 so we can use them!
>> >
>> > On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield 
>> > wrote:
>> >
>> >> What is the current status of support for Data Page V2?  Is it
>> recommended
>> >> for production workloads?
>> >>
>> >> Thanks,
>> >> Micah
>> >>
>> >
>> >
>> > --
>> > Ryan Blue
>> > Software Engineer
>> > Netflix
>> >
>>
>


Re: Current status of Data Page V2?

2020-10-09 Thread Jacques Nadeau
The big win in v2 pages (if I remember correctly) is that the variable
length encoding is no longer interleaved. That would provide a big
performance lift when pulling into arrow vectors (and variable length
decoding typically dominates total read processing time, on average I've
seen 5-10x per cell cpu cost increase for variable reads over scalar
reads). AFAIK, there is still no option for that in V1.

On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield 
wrote:

> Thanks for the quick reply Ryan.
>
>
> > We only use v1 and it still works well. That said, I'd love to make some
> > progress on better encodings and finalizing v2 so we can use them!
>
>
> Are there JIRAs or other documentation that is tracking this work?
>
> Thanks,
> Micah
>
>
> On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue  wrote:
>
> > While there isn't anything wrong with it, the same challenges have been
> > solved in different ways with v1 pages. The main difference is that v2
> > pages are broken at record boundaries, and v1 pages weren't guaranteed to
> > be. But, in order to write page indexes near the footer, breaking pages
> at
> > record boundaries is required. So you know if you have page indexes, you
> > can actually use them to skip through pages safely. That removes much of
> > the need for v2 pages.
> >
> > The main drawback to using v2 pages is that the v2 spec is unfinished,
> and
> > I don't think there is a way to use just the new pages. So you'd possibly
> > end up pulling in other beta features that probably shouldn't be used if
> > you want to stick with what is required for compatibility across
> > implementations.
> >
> > We only use v1 and it still works well. That said, I'd love to make some
> > progress on better encodings and finalizing v2 so we can use them!
> >
> > On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield 
> > wrote:
> >
> >> What is the current status of support for Data Page V2?  Is it
> recommended
> >> for production workloads?
> >>
> >> Thanks,
> >> Micah
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>


Re: Current status of Data Page V2?

2020-10-08 Thread Micah Kornfield
Thanks for the quick reply Ryan.


> We only use v1 and it still works well. That said, I'd love to make some
> progress on better encodings and finalizing v2 so we can use them!


Are there JIRAs or other documentation that is tracking this work?

Thanks,
Micah


On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue  wrote:

> While there isn't anything wrong with it, the same challenges have been
> solved in different ways with v1 pages. The main difference is that v2
> pages are broken at record boundaries, and v1 pages weren't guaranteed to
> be. But, in order to write page indexes near the footer, breaking pages at
> record boundaries is required. So you know if you have page indexes, you
> can actually use them to skip through pages safely. That removes much of
> the need for v2 pages.
>
> The main drawback to using v2 pages is that the v2 spec is unfinished, and
> I don't think there is a way to use just the new pages. So you'd possibly
> end up pulling in other beta features that probably shouldn't be used if
> you want to stick with what is required for compatibility across
> implementations.
>
> We only use v1 and it still works well. That said, I'd love to make some
> progress on better encodings and finalizing v2 so we can use them!
>
> On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield 
> wrote:
>
>> What is the current status of support for Data Page V2?  Is it recommended
>> for production workloads?
>>
>> Thanks,
>> Micah
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Current status of Data Page V2?

2020-10-08 Thread Ryan Blue
While there isn't anything wrong with it, the same challenges have been
solved in different ways with v1 pages. The main difference is that v2
pages are broken at record boundaries, and v1 pages weren't guaranteed to
be. But, in order to write page indexes near the footer, breaking pages at
record boundaries is required. So you know if you have page indexes, you
can actually use them to skip through pages safely. That removes much of
the need for v2 pages.

The main drawback to using v2 pages is that the v2 spec is unfinished, and
I don't think there is a way to use just the new pages. So you'd possibly
end up pulling in other beta features that probably shouldn't be used if
you want to stick with what is required for compatibility across
implementations.

We only use v1 and it still works well. That said, I'd love to make some
progress on better encodings and finalizing v2 so we can use them!

On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield 
wrote:

> What is the current status of support for Data Page V2?  Is it recommended
> for production workloads?
>
> Thanks,
> Micah
>


-- 
Ryan Blue
Software Engineer
Netflix