Yong Zhang created AVRO-1786:
--------------------------------
Summary: Strange IndexOutofBoundException in
GenericDatumReader.readString
Key: AVRO-1786
URL: https://issues.apache.org/jira/browse/AVRO-1786
Project: Avro
Issue Type: Bug
Components: java
Affects Versions: 1.7.4
Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64
Use IBM JVM:
IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References
20140515_199835 (JIT enabled, AOT enabled)
Reporter: Yong Zhang
We are using the AVRO 1.7.4 comes with IBM BigInsight V3.0.0.2, which also
includes Apache Hadoop 2.2.0.
There is a dataset stored in AVRO format, and also there is a daily delta
changes of this dataset in CSV format ingested in HDFS. The ETL will transform
the CSV format into AVRO format in the 1st MR job, then try to merge the new
AVRO records with existing AVRO records in a 2nd MR job, and final output will
be a new merged dataset, store in avro format in HDFS and ready to use.
This works for over a year, until suddenly, we face a strange exception throws
out from the 2nd MR job. Here is the stacktrace in the MR job:
java.lang.IndexOutOfBoundsException
at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191)
at java.io.DataInputStream.read(DataInputStream.java:160)
at
org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184)
at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
at
org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
at
org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348)
at
org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143)
at
org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125)
at
org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
at
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
at
org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108)
at
org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48)
at
org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142)
at
org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117)
at
org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165)
at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at
java.security.AccessController.doPrivileged(AccessController.java:366)
at javax.security.auth.Subject.doAs(Subject.java:572)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1502)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
It throws from the reducer job. In this case, the input of the MR job in fact
contains 2 parts, one is the new AVRO records generated from the delta; the
other one is the existing latest AVRO records before the delta. Our reducer
code in fact doesn't look like be invoked yet, but AVRO failed to read the avro
records in this case, and I am not 100% sure which part of the above 2 it
complains.
When I first faced this error, I thought it is the new delta data contain some
data causing this issue, but when I test different cases, it is more like that
AVRO cannot read the data generated by itself. The stack trace itself looks
like the final AVRO data is corrupted, but I don't know why and how.
My test cases like following:
1) Day1 data + day2 delta -> merged without issue, and generate day2 data
2) Day2 data cannot merge day3 delta -> got the exception in MR job
3) day1 data + day2 delta + day3 delta -> merged without issue, generated day3
data, but after that, it cannot merge with new incoming delta.
I tried to use AVRO 1.7.7, which looks like the latest so far, but I still get
this error. I believe some data may trigger this bug, but the data in CSV file
all looks fine.
here is the avro schema file we defined for this dataset:
{
"namespace" : "xxxxxx",
"type" : "record",
"name" : "Lists",
"fields" : [
{"name" : "account_id", "type" : "long"},
{"name" : "list_id", "type" : "string"},
{"name" : "sequence_id", "type" : ["int", "null"]} ,
{"name" : "name", "type" : ["string", "null"]},
{"name" : "state", "type" : ["string", "null"]},
{"name" : "description", "type" : ["string", "null"]},
{"name" : "dynamic_filtered_list", "type" : ["int", "null"]},
{"name" : "filter_criteria", "type" : ["string", "null"]},
{"name" : "created_at", "type" : ["long", "null"]},
{"name" : "updated_at", "type" : ["long", "null"]},
{"name" : "deleted_at", "type" : ["long", "null"]},
{"name" : "favorite", "type" : ["int", "null"]},
{"name" : "delta", "type" : ["boolean", "null"]},
{
"name" : "list_memberships", "type" : {
"type" : "array", "items" : {
"name" : "ListMembership", "type" : "record",
"fields" : [
{"name" : "channel_id", "type" : "string"},
{"name" : "created_at", "type" : ["long", "null"]},
{"name" : "created_source", "type" : ["string",
"null"]},
{"name" : "deleted_at", "type" : ["long", "null"]},
{"name" : "sequence_id", "type" : ["long", "null"]}
]
}
}
}
]
}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)