Hello all,
I'm using AvroParquetOutputFormat and AvroParquetInputFormat for a
pair of Hadoop applications -- one that writes avro-parquet and one
that reads. Actually, I'm using Pydoop (
https://github.com/crs4/pydoop) but the actual I/O is done through the
AvroParquet classes.
The writer seems to succeed. Instead, the reader, when processing the
other application's result, crashes with a ParquetDecodingException.
Here's the syslog output with the stack trace:
2016-03-04 12:46:50,075 INFO [main] org.apache.hadoop.mapred.MapTask:
Processing split: ParquetInputSplit{part:
hdfs://localhost:9000/user/pireddu/seqal_mini_ref_bwamem_avo_output/tmp/part-m-00000.parquet
start: 0 end: 16916 length: 16916 hosts: []}
2016-03-04 12:46:50,846 WARN [main]
org.apache.hadoop.mapred.YarnChild: Exception running child :
org.apache.parquet.io.ParquetDecodingException: Can not read value at
1 in block 0 in file
hdfs://localhost:9000/user/pireddu/seqal_mini_ref_bwamem_avo_output/tmp/part-m-00000.parquet
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
at
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
at
it.crs4.pydoop.mapreduce.pipes.PydoopAvroBridgeReaderBase.initialize(PydoopAvroBridgeReaderBase.java:66)
at
it.crs4.pydoop.mapreduce.pipes.PydoopAvroBridgeValueReader.initialize(PydoopAvroBridgeValueReader.java:38)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:545)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:783)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassCastException:
org.bdgenomics.formats.avro.Contig cannot be cast to java.lang.Integer
at org.bdgenomics.formats.avro.AlignmentRecord.put(AlignmentRecord.java:258)
at
org.apache.parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:168)
at
org.apache.parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:46)
at
org.apache.parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:95)
at
org.apache.parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:189)
at
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
... 11 more
I'm using parquet 1.8.1 and avro 1.7.6. I'm able to read the parquet
file with parquet-tools-1.8.1, so I'm inclined to think that the file
is valid.
Contig is the first class defined in my avro schema:
file schema:
org.bdgenomics.formats.avro.AlignmentRecord
--------------------------------------------------------------------------------
contig: OPTIONAL F:6
.contigName: OPTIONAL BINARY O:UTF8 R:0 D:2
.contigLength: OPTIONAL INT64 R:0 D:2
...and so on.
Can someone suggest what might be causing the problem when reading?
Any help would be appreciated!
Thanks,
Luca