[jira] [Resolved] (PARQUET-531) Can't read past first page in a column

Wes McKinney (JIRA) Wed, 24 Feb 2016 13:04:57 -0800

     [ 
https://issues.apache.org/jira/browse/PARQUET-531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wes McKinney resolved PARQUET-531.
----------------------------------
    Resolution: Fixed

This was fixed in https://github.com/apache/parquet-cpp/pull/62. I verified 
that invoking {{parquet_reader}} on the attached file now prints the contents 
without failing. Thank you!

> Can't read past first page in a column
> --------------------------------------
>
>                 Key: PARQUET-531
>                 URL: https://issues.apache.org/jira/browse/PARQUET-531
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>         Environment: Ubuntu Linux 14.04 (no obvious platform dependence), 
> Parquet file created by Apache Spark 1.5.0 on the same platform. 
>            Reporter: Spiro Michaylov
>            Assignee: Deepak Majeti
>         Attachments: 
> part-r-00031-e5d9a4ef-d73e-406c-8c2f-9ad1f20ebf8e.gz.parquet
>
>
> Building the code as of 2/14/2015 and adding the obvious three lines of code 
> to serialized-page.cc to enable the newly added CompressionCodec::GZIP:
> {code}
>      case parquet::CompressionCodec::GZIP:
>        decompressor_.reset(new GZipCodec());
>        break;
> {code}
> I try to run the parquet_reader example on the column I'm about to attach, 
> which was created by Apache Spark 1.5.0. It works surprisingly well until it 
> hits the end of the first page, where it dies with  
> {quote}
> Parquet error: Value was non-null, but has not been buffered
> {quote}
> I realize you may be reluctant to look at this because (a) the GZip support 
> is new and (b) I had to modify the code to enable it, but actually things 
> seem to decompress just fine (congratulations: this is awesome!): looking at 
> the problem in the debugger and tracing through a bit it seems to me like the 
> buffering is a bit screwed up in general -- some kind of confusion between 
> the buffering at the Scanner and Reader levels. I can reproduce the problem 
> by reading through just a single column too. 
> It fails after 128 rows, which is suspicious given this line in 
> column/scanner.h:
> {code}
>     DEFAULT_SCANNER_BATCH_SIZE = 128;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (PARQUET-531) Can't read past first page in a column

Reply via email to