I took a quick look at this -- DataPageV2 has a slightly different structure from DataPageV1, as indicated here
https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift#L555 In DataPageV1, the encoded repetition/definition levels are compressed together with the values in the data page. In DataPageV2, only the values are compressed. I'll see if I can fashion a fix sufficient to read the test data file, but more extensive testing will be required to extend the other unit tests to test both reading and writing both types of data pages. On Tue, Apr 30, 2019 at 8:56 AM Curt Hagenlocher <[email protected]> wrote: > > Thanks! Either the documentation is a bit sparse for that level of detail, > or I haven't been looking in the right place. The factoring of the Java > implementation makes it hard for me to see what's going on there, but the > Rust implementation is straightforward enough despite my utter lack of > familiarity with the language. > > On Mon, Apr 29, 2019 at 10:41 AM Ivan Sadikov <[email protected]> > wrote: > > > Not in V2, in V1 the whole page is encoded, but in V2 it is only values, if > > I remember correctly. So we would have to extract repetition and definition > > levels bytes and then decode values. > > > > You can check out code in parquet rust module! > > > > I am not sure about parquet-cpp, we can use that implementation as > > reference there. > > > > > > On Mon, 29 Apr 2019 at 5:39 PM, Curt Hagenlocher <[email protected]> > > wrote: > > > > > Would that be covered by PARQUET-458 ( > > > https://issues.apache.org/jira/browse/PARQUET-458)? > > > > > > On Mon, Apr 29, 2019 at 8:18 AM Wes McKinney <[email protected]> > > wrote: > > > > > > > Is there a JIRA issue about data page v2 issues in parquet-cpp? > > > > > > > > On Mon, Apr 29, 2019 at 9:57 AM Curt Hagenlocher <[email protected] > > > > > > > wrote: > > > > > > > > > > But the data page is decoded only after it is decompressed, so I > > > > wouldn’t expect an unsupported data page to cause a decompression > > > failure. > > > > > > > > > > (I am playing with adding V2 support to Parquet.Net.) > > > > > > > > > > Sent from my iPhone > > > > > > > > > > > On Apr 29, 2019, at 7:30 AM, Ivan Sadikov <[email protected]> > > > > wrote: > > > > > > > > > > > > If you are referring to the file in Apache/parquet-testing > > > repository, > > > > it > > > > > > is a valid Parquet file with data encoded into data page v2. > > > > > > > > > > > > You can easily test it with “cargo install parquet” and > > “parquet-read > > > > > > filepath”. > > > > > > > > > > > > I am not sure what kind of code you have written, but the error you > > > > have > > > > > > encountered could be related to the fact that parquet-cpp does not > > > > support > > > > > > decoding of data page v2. > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > Ivan > > > > > > > > > > > > On Mon, 29 Apr 2019 at 3:36 PM, Curt Hagenlocher < > > > [email protected] > > > > > > > > > > > wrote: > > > > > > > > > > > >> To the best of my ability to tell, there is invalid Snappy data in > > > > the file > > > > > >> parquet-testing/data/datapage_v2.snappy.parquet. I can neither > > read > > > > it with > > > > > >> my own code nor with pyarrow 0.13.0. Is this expected to work? > > > > > >> > > > > > >> Thanks! > > > > > >> -Curt > > > > > >> > > > > > > > > >
