Hi Wes,

I think it’s the file that I added for parquet-rs to test data page v2 back
then - it is not used anywhere else.


Cheers,

Ivan

On Mon, 3 Jun 2019 at 10:15 PM, Wes McKinney <[email protected]> wrote:

> I took a quick look at this -- DataPageV2 has a slightly different
> structure from DataPageV1, as indicated here
>
>
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift#L555
>
> In DataPageV1, the encoded repetition/definition levels are compressed
> together with the values in the data page. In DataPageV2, only the
> values are compressed. I'll see if I can fashion a fix sufficient to
> read the test data file, but more extensive testing will be required
> to extend the other unit tests to test both reading and writing both
> types of data pages.
>
> On Tue, Apr 30, 2019 at 8:56 AM Curt Hagenlocher <[email protected]>
> wrote:
> >
> > Thanks! Either the documentation is a bit sparse for that level of
> detail,
> > or I haven't been looking in the right place. The factoring of the Java
> > implementation makes it hard for me to see what's going on there, but the
> > Rust implementation is straightforward enough despite my utter lack of
> > familiarity with the language.
> >
> > On Mon, Apr 29, 2019 at 10:41 AM Ivan Sadikov <[email protected]>
> > wrote:
> >
> > > Not in V2, in V1 the whole page is encoded, but in V2 it is only
> values, if
> > > I remember correctly. So we would have to extract repetition and
> definition
> > > levels bytes and then decode values.
> > >
> > > You can check out code in parquet rust module!
> > >
> > > I am not sure about parquet-cpp, we can use that implementation as
> > > reference there.
> > >
> > >
> > > On Mon, 29 Apr 2019 at 5:39 PM, Curt Hagenlocher <[email protected]
> >
> > > wrote:
> > >
> > > > Would that be covered by PARQUET-458 (
> > > > https://issues.apache.org/jira/browse/PARQUET-458)?
> > > >
> > > > On Mon, Apr 29, 2019 at 8:18 AM Wes McKinney <[email protected]>
> > > wrote:
> > > >
> > > > > Is there a JIRA issue about data page v2 issues in parquet-cpp?
> > > > >
> > > > > On Mon, Apr 29, 2019 at 9:57 AM Curt Hagenlocher <
> [email protected]
> > > >
> > > > > wrote:
> > > > > >
> > > > > > But the data page is decoded only after it is decompressed, so I
> > > > > wouldn’t expect an unsupported data page to cause a decompression
> > > > failure.
> > > > > >
> > > > > > (I am playing with adding V2 support to Parquet.Net.)
> > > > > >
> > > > > > Sent from my iPhone
> > > > > >
> > > > > > > On Apr 29, 2019, at 7:30 AM, Ivan Sadikov <
> [email protected]>
> > > > > wrote:
> > > > > > >
> > > > > > > If you are referring to the file in Apache/parquet-testing
> > > > repository,
> > > > > it
> > > > > > > is a valid Parquet file with data encoded into data page v2.
> > > > > > >
> > > > > > > You can easily test it with “cargo install parquet” and
> > > “parquet-read
> > > > > > > filepath”.
> > > > > > >
> > > > > > > I am not sure what kind of code you have written, but the
> error you
> > > > > have
> > > > > > > encountered could be related to the fact that parquet-cpp does
> not
> > > > > support
> > > > > > > decoding of data page v2.
> > > > > > >
> > > > > > >
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Ivan
> > > > > > >
> > > > > > > On Mon, 29 Apr 2019 at 3:36 PM, Curt Hagenlocher <
> > > > [email protected]
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> To the best of my ability to tell, there is invalid Snappy
> data in
> > > > > the file
> > > > > > >> parquet-testing/data/datapage_v2.snappy.parquet. I can neither
> > > read
> > > > > it with
> > > > > > >> my own code nor with pyarrow 0.13.0. Is this expected to work?
> > > > > > >>
> > > > > > >> Thanks!
> > > > > > >> -Curt
> > > > > > >>
> > > > >
> > > >
> > >
>

Reply via email to