Yes -- what I mean is that I want to put the DataPageV2 read path
through the same unit testing rigor as the DataPageV1 path. I will
take care of it now that I understand what's wrong; I commented on
PARQUET-458

On Tue, Jun 4, 2019 at 12:38 AM Ivan Sadikov <[email protected]> wrote:
>
> Hi Wes,
>
> I think it’s the file that I added for parquet-rs to test data page v2 back
> then - it is not used anywhere else.
>
>
> Cheers,
>
> Ivan
>
> On Mon, 3 Jun 2019 at 10:15 PM, Wes McKinney <[email protected]> wrote:
>
> > I took a quick look at this -- DataPageV2 has a slightly different
> > structure from DataPageV1, as indicated here
> >
> >
> > https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift#L555
> >
> > In DataPageV1, the encoded repetition/definition levels are compressed
> > together with the values in the data page. In DataPageV2, only the
> > values are compressed. I'll see if I can fashion a fix sufficient to
> > read the test data file, but more extensive testing will be required
> > to extend the other unit tests to test both reading and writing both
> > types of data pages.
> >
> > On Tue, Apr 30, 2019 at 8:56 AM Curt Hagenlocher <[email protected]>
> > wrote:
> > >
> > > Thanks! Either the documentation is a bit sparse for that level of
> > detail,
> > > or I haven't been looking in the right place. The factoring of the Java
> > > implementation makes it hard for me to see what's going on there, but the
> > > Rust implementation is straightforward enough despite my utter lack of
> > > familiarity with the language.
> > >
> > > On Mon, Apr 29, 2019 at 10:41 AM Ivan Sadikov <[email protected]>
> > > wrote:
> > >
> > > > Not in V2, in V1 the whole page is encoded, but in V2 it is only
> > values, if
> > > > I remember correctly. So we would have to extract repetition and
> > definition
> > > > levels bytes and then decode values.
> > > >
> > > > You can check out code in parquet rust module!
> > > >
> > > > I am not sure about parquet-cpp, we can use that implementation as
> > > > reference there.
> > > >
> > > >
> > > > On Mon, 29 Apr 2019 at 5:39 PM, Curt Hagenlocher <[email protected]
> > >
> > > > wrote:
> > > >
> > > > > Would that be covered by PARQUET-458 (
> > > > > https://issues.apache.org/jira/browse/PARQUET-458)?
> > > > >
> > > > > On Mon, Apr 29, 2019 at 8:18 AM Wes McKinney <[email protected]>
> > > > wrote:
> > > > >
> > > > > > Is there a JIRA issue about data page v2 issues in parquet-cpp?
> > > > > >
> > > > > > On Mon, Apr 29, 2019 at 9:57 AM Curt Hagenlocher <
> > [email protected]
> > > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > But the data page is decoded only after it is decompressed, so I
> > > > > > wouldn’t expect an unsupported data page to cause a decompression
> > > > > failure.
> > > > > > >
> > > > > > > (I am playing with adding V2 support to Parquet.Net.)
> > > > > > >
> > > > > > > Sent from my iPhone
> > > > > > >
> > > > > > > > On Apr 29, 2019, at 7:30 AM, Ivan Sadikov <
> > [email protected]>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > If you are referring to the file in Apache/parquet-testing
> > > > > repository,
> > > > > > it
> > > > > > > > is a valid Parquet file with data encoded into data page v2.
> > > > > > > >
> > > > > > > > You can easily test it with “cargo install parquet” and
> > > > “parquet-read
> > > > > > > > filepath”.
> > > > > > > >
> > > > > > > > I am not sure what kind of code you have written, but the
> > error you
> > > > > > have
> > > > > > > > encountered could be related to the fact that parquet-cpp does
> > not
> > > > > > support
> > > > > > > > decoding of data page v2.
> > > > > > > >
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > >
> > > > > > > > Ivan
> > > > > > > >
> > > > > > > > On Mon, 29 Apr 2019 at 3:36 PM, Curt Hagenlocher <
> > > > > [email protected]
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> To the best of my ability to tell, there is invalid Snappy
> > data in
> > > > > > the file
> > > > > > > >> parquet-testing/data/datapage_v2.snappy.parquet. I can neither
> > > > read
> > > > > > it with
> > > > > > > >> my own code nor with pyarrow 0.13.0. Is this expected to work?
> > > > > > > >>
> > > > > > > >> Thanks!
> > > > > > > >> -Curt
> > > > > > > >>
> > > > > >
> > > > >
> > > >
> >

Reply via email to