Hello all,

I'm using AvroParquetOutputFormat and AvroParquetInputFormat for a
pair of Hadoop applications -- one that writes avro-parquet and one
that reads.  Actually, I'm using Pydoop (
https://github.com/crs4/pydoop) but the actual I/O is done through the
AvroParquet classes.

The writer seems to succeed.  Instead, the reader, when processing the
other application's result, crashes with a ParquetDecodingException.
Here's the syslog output with the stack trace:


2016-03-04 12:46:50,075 INFO [main] org.apache.hadoop.mapred.MapTask:
Processing split: ParquetInputSplit{part:
hdfs://localhost:9000/user/pireddu/seqal_mini_ref_bwamem_avo_output/tmp/part-m-00000.parquet
start: 0 end: 16916 length: 16916 hosts: []}
2016-03-04 12:46:50,846 WARN [main]
org.apache.hadoop.mapred.YarnChild: Exception running child :
org.apache.parquet.io.ParquetDecodingException: Can not read value at
1 in block 0 in file
hdfs://localhost:9000/user/pireddu/seqal_mini_ref_bwamem_avo_output/tmp/part-m-00000.parquet
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
at 
it.crs4.pydoop.mapreduce.pipes.PydoopAvroBridgeReaderBase.initialize(PydoopAvroBridgeReaderBase.java:66)
at 
it.crs4.pydoop.mapreduce.pipes.PydoopAvroBridgeValueReader.initialize(PydoopAvroBridgeValueReader.java:38)
at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:545)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:783)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassCastException:
org.bdgenomics.formats.avro.Contig cannot be cast to java.lang.Integer
at org.bdgenomics.formats.avro.AlignmentRecord.put(AlignmentRecord.java:258)
at 
org.apache.parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:168)
at 
org.apache.parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:46)
at 
org.apache.parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:95)
at 
org.apache.parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:189)
at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
... 11 more


I'm using parquet 1.8.1 and avro 1.7.6. I'm able to read the parquet
file with parquet-tools-1.8.1, so I'm inclined to think that the file
is valid.

Contig is the first class defined in my avro schema:

file schema:
org.bdgenomics.formats.avro.AlignmentRecord
--------------------------------------------------------------------------------
contig:                               OPTIONAL F:6
.contigName:                          OPTIONAL BINARY O:UTF8 R:0 D:2
.contigLength:                        OPTIONAL INT64 R:0 D:2
...and so on.

Can someone suggest what might be causing the problem when reading?
Any help would be appreciated!

Thanks,

Luca

Reply via email to