[ 
https://issues.apache.org/jira/browse/PARQUET-400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086556#comment-15086556
 ] 

Jason Altekruse commented on PARQUET-400:
-----------------------------------------

[~dweeks] can you please try out the branch on the PR? I believe I figured out 
both the original issue as well as the secondary issue that was coming up when 
I ran with a different hadoop version. Both are explained by the reads to the 
FSDataInputSteam API not returning all requested data. Before the bytebuffer 
changes the readFully() method was used to avoid this problem, there is no 
equivalent in the 2.x API for reading into a ByteBuffer and we had missed 
filling in this functional gap ourselves.

The second error message was related to a different compatibility codepath for 
some old semi-corrupt files that could be falsely triggered by this incomplete 
read, this would only happen before the error you had reported if a certain 
subset of the column chunk had been returned. If enough was returned it could 
avoid this condition and make it right to the error you reported, explaining 
the inconsistent behavior.

https://github.com/apache/parquet-mr/pull/306

> Error reading some files after PARQUET-77 bytebuffer read path
> --------------------------------------------------------------
>
>                 Key: PARQUET-400
>                 URL: https://issues.apache.org/jira/browse/PARQUET-400
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Jason Altekruse
>            Assignee: Jason Altekruse
>         Attachments: bytebyffer_read_fail.gz.parquet
>
>
> This issue is based on a discussion on the list started by [~dweeks]
> Full discussion:
> https://mail-archives.apache.org/mod_mbox/parquet-dev/201512.mbox/%3CCAMpYv7C_szTheua9N95bXvbd2ROmV63BFiJTK-K-aDNK6ZNBKA%40mail.gmail.com%3E
> From the thread (he later provided a small repro file that is attached here):
> Just wanted to see if you or anyone else has run into problems reading
> files after the ByteBuffer patch.  I've been running into issues and have
> narrowed it down to the ByteBuffer commit using a small repro file (written
> with 1.6.0, unfortunately can't share the data).
> It doesn't happen for every file, but those that fail give this error:
> can not read class org.apache.parquet.format.PageHeader: Required field
> 'uncompressed_page_size' was not found in serialized data! Struct:
> PageHeader(type:null, uncompressed_page_size:0, compressed_page_size:0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to