[ 
https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15112664#comment-15112664
 ] 

Yong Zhang commented on AVRO-1786:
----------------------------------

With lots of time debugging this, I found out some interested facts, and want 
to know if any one can provide more information related to this.

In our AVRO schema, if the "list_id", which is a UUID, contain the following 
characters "3a3ffb10be8b11e3977ad4ae5284344f", that record will cause this 
exception.

In this case, if you check the following Avro class: BinaryDecoder.java, from 
here:

https://github.com/apache/avro/blob/release-1.7.4/lang/java/avro/src/main/java/org/apache/avro/io/BinaryDecoder.java

The above link is pointing to release 1.7.4, and 
starting from line 259:

int length = readInt();
Utf8 result = (old != null ? old : new Utf8());
    result.setByteLength(length);
    if (0 != length) {
      doReadBytes(result.getBytes(), 0, length);
    }
In this case when the exception happens, the length in fact is "-51", which is 
calculated from method "readInt()", causing IndexOutOfBoundsException.

Based on AVRO-1198, when a negative value return in this case, it means the the 
data is Malformed, and this JIRA even patched a message to show it in the 
Exception.

But in my case:

The avro data is NOT malformed. Why? I can query the whole ready to merge data 
set in the Hive without any issue, even for the complained records.
In my case, the real runtime object is "DirectBinaryDecoder", as you can see 
the stacktrace, doReadBytes method indeed point to this class.
So the ReadInt() method returns "-51" in this case, must be from 
DirectBinaryDecoder class, which I list here:
           
https://github.com/apache/avro/blob/release-1.7.4/lang/java/avro/src/main/java/org/apache/avro/io/DirectBinaryDecoder.java
           starting from line 97, but I really don't know why in my case, it 
return "-51".
I tested with AVRO 1.7.7 (The latest release of 1.7.x), the same problem still 
happens.

Thanks

> Strange IndexOutofBoundException in GenericDatumReader.readString
> -----------------------------------------------------------------
>
>                 Key: AVRO-1786
>                 URL: https://issues.apache.org/jira/browse/AVRO-1786
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.7.4
>         Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64
> Use IBM JVM:
> IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 
> 20140515_199835 (JIT enabled, AOT enabled)
>            Reporter: Yong Zhang
>
> We are using the AVRO 1.7.4 comes with IBM BigInsight V3.0.0.2, which also 
> includes Apache Hadoop 2.2.0.
> There is a dataset stored in AVRO format, and also there is a daily delta 
> changes of this dataset in CSV format ingested in HDFS. The ETL will 
> transform the CSV format into AVRO format in the 1st MR job, then try to 
> merge the new AVRO records with existing AVRO records in a 2nd MR job, and 
> final output will be a new merged dataset, store in avro format in HDFS and 
> ready to use.
> This works for over a year, until suddenly, we face a strange exception 
> throws out from the 2nd MR job. Here is the stacktrace in the MR job:
> java.lang.IndexOutOfBoundsException
>       at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191)
>       at java.io.DataInputStream.read(DataInputStream.java:160)
>       at 
> org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184)
>       at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
>       at 
> org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
>       at 
> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348)
>       at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143)
>       at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125)
>       at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121)
>       at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
>       at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
>       at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
>       at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
>       at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108)
>       at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48)
>       at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142)
>       at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117)
>       at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297)
>       at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165)
>       at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
>       at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>       at 
> java.security.AccessController.doPrivileged(AccessController.java:366)
>       at javax.security.auth.Subject.doAs(Subject.java:572)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1502)
>       at org.apache.hadoop.mapred.Child.main(Child.java:249)
> It throws from the reducer job. In this case, the input of the MR job in fact 
> contains 2 parts, one is the new AVRO records generated from the delta; the 
> other one is the existing latest AVRO records before the delta. Our reducer 
> code in fact doesn't look like be invoked yet, but AVRO failed to read the 
> avro records in this case, and I am not 100% sure which part of the above 2 
> it complains.
> When I first faced this error, I thought it is the new delta data contain 
> some data causing this issue, but when I test different cases, it is more 
> like that AVRO cannot read the data generated by itself. The stack trace 
> itself looks like the final AVRO data is corrupted, but I don't know why and 
> how.
> My test cases like following:
> 1) Day1 data + day2 delta -> merged without issue, and generate day2 data
> 2) Day2 data cannot merge day3 delta -> got the exception in MR job
> 3) day1 data + day2 delta + day3 delta -> merged without issue, generated 
> day3 data, but after that, it cannot merge with new incoming delta.
> I tried to use AVRO 1.7.7, which looks like the latest so far, but I still 
> get this error. I believe some data may trigger this bug, but the data in CSV 
> file all looks fine.
> here is the avro schema file we defined for this dataset:
> {
>     "namespace" : "xxxxxx",
>     "type" : "record",
>     "name" : "Lists",
>     "fields" : [
>         {"name" : "account_id", "type" : "long"},
>         {"name" : "list_id", "type" : "string"},
>         {"name" : "sequence_id", "type" : ["int", "null"]} ,
>         {"name" : "name", "type" : ["string", "null"]},
>         {"name" : "state", "type" : ["string", "null"]},
>         {"name" : "description", "type" : ["string", "null"]},
>         {"name" : "dynamic_filtered_list", "type" : ["int", "null"]},
>         {"name" : "filter_criteria", "type" : ["string", "null"]},
>         {"name" : "created_at", "type" : ["long", "null"]},
>         {"name" : "updated_at", "type" : ["long", "null"]},
>         {"name" : "deleted_at", "type" : ["long", "null"]},
>         {"name" : "favorite", "type" : ["int", "null"]},
>         {"name" : "delta", "type" : ["boolean", "null"]},
>         {
>             "name" : "list_memberships", "type" : {
>                 "type" : "array", "items" : {
>                     "name" : "ListMembership", "type" : "record",
>                     "fields" : [
>                         {"name" : "channel_id", "type" : "string"},
>                         {"name" : "created_at", "type" : ["long", "null"]},
>                         {"name" : "created_source", "type" : ["string", 
> "null"]},
>                         {"name" : "deleted_at", "type" : ["long", "null"]},
>                         {"name" : "sequence_id", "type" : ["long", "null"]}
>                     ]
>                 }
>             }
>         }
>     ]
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to