[ 
https://issues.apache.org/jira/browse/PARQUET-400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074606#comment-15074606
 ] 

Jason Altekruse commented on PARQUET-400:
-----------------------------------------

[~dweeks]  [~zhenxiao] apologies for the delay on this, I finally got a chance 
to try to debug it further and I unfortunately am receiving a different error 
when reading the file from HDFS.

I did not see any issue reading the file from a 2.6.0 cdh cluster through 
parquet-cat. I created a test that used the example GroupReadSupport and I saw 
the following error [1]. I'm confused though why this wouldn't have been 
reported by parquet-tools as well.

I had some trouble trying to read the file from s3, can you give me a little 
more info about your setup, are you using jets3t? What version are you using to 
read the file? Are you using s3:// s3a:// or s3n:// ?

[1]
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
block -1 in file hdfs://h011.d.drem.io/tmp/bytebyffer_read_fail.gz.parquet
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:125)
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:129)
        at 
org.apache.drill.TestExampleQueries.random(TestExampleQueries.java:49)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Caused by: org.apache.parquet.io.ParquetDecodingException: more than one 
dictionary page in column [value, bag, array] DOUBLE
        at 
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:597)
        at 
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:541)
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:130)
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)

> Error reading some files after PARQUET-77 bytebuffer read path
> --------------------------------------------------------------
>
>                 Key: PARQUET-400
>                 URL: https://issues.apache.org/jira/browse/PARQUET-400
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Jason Altekruse
>            Assignee: Jason Altekruse
>         Attachments: bytebyffer_read_fail.gz.parquet
>
>
> This issue is based on a discussion on the list started by [~dweeks]
> Full discussion:
> https://mail-archives.apache.org/mod_mbox/parquet-dev/201512.mbox/%3CCAMpYv7C_szTheua9N95bXvbd2ROmV63BFiJTK-K-aDNK6ZNBKA%40mail.gmail.com%3E
> From the thread (he later provided a small repro file that is attached here):
> Just wanted to see if you or anyone else has run into problems reading
> files after the ByteBuffer patch.  I've been running into issues and have
> narrowed it down to the ByteBuffer commit using a small repro file (written
> with 1.6.0, unfortunately can't share the data).
> It doesn't happen for every file, but those that fail give this error:
> can not read class org.apache.parquet.format.PageHeader: Required field
> 'uncompressed_page_size' was not found in serialized data! Struct:
> PageHeader(type:null, uncompressed_page_size:0, compressed_page_size:0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to