[jira] [Comment Edited] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString

2017-08-04 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114574#comment-16114574
 ] 

BELUGA BEHR edited comment on AVRO-1786 at 8/4/17 4:17 PM:
---

In regards to the original case, it is interesting to note that the Reducer 
failed while [deserializing the 
key|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/ReduceContextImpl.java#L142].
  However, the Map output key is of type 
{{mergeJob.setMapOutputKeyClass(Contact2MergePartitionKey.class)}} which is not 
an {{AvroKey}} class.  So, I'm not sure why the Avro deserializer is in play 
here at all.


was (Author: belugabehr):
In regards to the original case, it is interesting to note that the Reducer 
failed while [deserializing the 
key|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/ReduceContextImpl.java#L142].
  However, the Map output key is of type 
{{mergeJob.setMapOutputKeyClass(Contact2MergePartitionKey.class)}} which is not 
an {{AvroKey}} class.

> Strange IndexOutofBoundException in GenericDatumReader.readString
> -
>
> Key: AVRO-1786
> URL: https://issues.apache.org/jira/browse/AVRO-1786
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.7.4, 1.7.7
> Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64
> Use IBM JVM:
> IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 
> 20140515_199835 (JIT enabled, AOT enabled)
>Reporter: Yong Zhang
>Priority: Minor
>
> Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running 
> IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no 
> yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE 
> 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT 
> enabled). Not sure if the JDK matters, but it is NOT Oracle JVM.
> We have a ETL implemented in a chain of MR jobs. In one MR job, it is going 
> to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is 
> in HDFS location B, and both contains the AVRO records binding to the same 
> AVRO schema. The record contains an unique id field, and a timestamp field. 
> The MR job is to merge the records based on the ID, and use the later 
> timestamp record to replace previous timestamp record, and omit the final 
> AVRO record out. Very straightforward.
> Now we faced a problem that one reducer keeps failing with the following 
> stacktrace on JobTracker:
> {code}
> java.lang.IndexOutOfBoundsException
>   at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191)
>   at java.io.DataInputStream.read(DataInputStream.java:160)
>   at 
> org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184)
>   at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
>   at 
> org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
>   at 
> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
>   at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
>   at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108)
>   at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48)
>   at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142)
>   at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117)
>   at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297)
>   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java

[jira] [Comment Edited] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString

2017-08-04 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114574#comment-16114574
 ] 

BELUGA BEHR edited comment on AVRO-1786 at 8/4/17 4:16 PM:
---

In regards to the original case, it is interesting to note that the Reducer 
failed while [deserializing the 
key|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/ReduceContextImpl.java#L142].
  However, the Map output key is of type 
{{mergeJob.setMapOutputKeyClass(Contact2MergePartitionKey.class}} which is not 
an {{AvroKey}} class.


was (Author: belugabehr):
In regards to the original case, it is interesting to note that the Reducer 
failed while [deserializing the 
key|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/ReduceContextImpl.java#L142].
  However, the Map output key is of type 
{{mergeJob.setMapOutputKeyClass(Contact2MergePartitionKey.class}} which is not 
an Avro class.

> Strange IndexOutofBoundException in GenericDatumReader.readString
> -
>
> Key: AVRO-1786
> URL: https://issues.apache.org/jira/browse/AVRO-1786
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.7.4, 1.7.7
> Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64
> Use IBM JVM:
> IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 
> 20140515_199835 (JIT enabled, AOT enabled)
>Reporter: Yong Zhang
>Priority: Minor
>
> Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running 
> IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no 
> yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE 
> 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT 
> enabled). Not sure if the JDK matters, but it is NOT Oracle JVM.
> We have a ETL implemented in a chain of MR jobs. In one MR job, it is going 
> to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is 
> in HDFS location B, and both contains the AVRO records binding to the same 
> AVRO schema. The record contains an unique id field, and a timestamp field. 
> The MR job is to merge the records based on the ID, and use the later 
> timestamp record to replace previous timestamp record, and omit the final 
> AVRO record out. Very straightforward.
> Now we faced a problem that one reducer keeps failing with the following 
> stacktrace on JobTracker:
> {code}
> java.lang.IndexOutOfBoundsException
>   at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191)
>   at java.io.DataInputStream.read(DataInputStream.java:160)
>   at 
> org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184)
>   at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
>   at 
> org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
>   at 
> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
>   at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
>   at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108)
>   at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48)
>   at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142)
>   at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117)
>   at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297)
>   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at 
> java.security.AccessController.doPrivileged(AccessControll

[jira] [Comment Edited] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString

2017-08-04 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114574#comment-16114574
 ] 

BELUGA BEHR edited comment on AVRO-1786 at 8/4/17 4:16 PM:
---

In regards to the original case, it is interesting to note that the Reducer 
failed while [deserializing the 
key|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/ReduceContextImpl.java#L142].
  However, the Map output key is of type 
{{mergeJob.setMapOutputKeyClass(Contact2MergePartitionKey.class)}} which is not 
an {{AvroKey}} class.


was (Author: belugabehr):
In regards to the original case, it is interesting to note that the Reducer 
failed while [deserializing the 
key|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/ReduceContextImpl.java#L142].
  However, the Map output key is of type 
{{mergeJob.setMapOutputKeyClass(Contact2MergePartitionKey.class}} which is not 
an {{AvroKey}} class.

> Strange IndexOutofBoundException in GenericDatumReader.readString
> -
>
> Key: AVRO-1786
> URL: https://issues.apache.org/jira/browse/AVRO-1786
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.7.4, 1.7.7
> Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64
> Use IBM JVM:
> IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 
> 20140515_199835 (JIT enabled, AOT enabled)
>Reporter: Yong Zhang
>Priority: Minor
>
> Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running 
> IBM BigInsight V3.0.0.2. In Apache term, it is Hadoop 2.2.0 with MRV1(no 
> yarn), and comes with AVRO 1.7.4, running with IBM J9 VM (build 2.7, JRE 
> 1.7.0 Linux amd64-64 Compressed References 20140515_199835 (JIT enabled, AOT 
> enabled). Not sure if the JDK matters, but it is NOT Oracle JVM.
> We have a ETL implemented in a chain of MR jobs. In one MR job, it is going 
> to merge 2 sets of AVRO data. Dataset1 is in HDFS location A, and Dataset2 is 
> in HDFS location B, and both contains the AVRO records binding to the same 
> AVRO schema. The record contains an unique id field, and a timestamp field. 
> The MR job is to merge the records based on the ID, and use the later 
> timestamp record to replace previous timestamp record, and omit the final 
> AVRO record out. Very straightforward.
> Now we faced a problem that one reducer keeps failing with the following 
> stacktrace on JobTracker:
> {code}
> java.lang.IndexOutOfBoundsException
>   at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:191)
>   at java.io.DataInputStream.read(DataInputStream.java:160)
>   at 
> org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:184)
>   at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
>   at 
> org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
>   at 
> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:143)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:125)
>   at 
> org.apache.avro.reflect.ReflectDatumReader.readString(ReflectDatumReader.java:121)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
>   at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
>   at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
>   at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108)
>   at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48)
>   at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142)
>   at 
> org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117)
>   at 
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297)
>   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at 
> java.security.AccessController.doPrivileged(Access

[jira] [Comment Edited] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString

2017-08-01 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16110083#comment-16110083
 ] 

BELUGA BEHR edited comment on AVRO-1786 at 8/2/17 1:10 AM:
---

I created a Fuzzer to test random combinations of values to try to reproduce 
this issue, but no luck.  During my investigation, I hit [HADOOP-11678] because 
I forgot to set the Serializer to use 
{{org.apache.avro.hadoop.io.AvroSerialization}} and I was instead getting the 
default Hadoop Avro Serialization class of 
{{org.apache.hadoop.io.serializer.avro.AvroSerialization}} which is very broken 
because it's erroneously buffering data.  However, during the time I was not 
setting the Serialization class correctly, I was getting some failures with 
stack traces that are very similar to the one reported here.  So, it makes me 
think that there may be a buffer that isn't being flushed during serialization. 
 Once i corrected the mistake, my Fuzzer ran for several hours and did not 
encounter any issues serializing or deserializing the random data.  I was 
serializing the data directly to an 
[IFile|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/IFile.java]
 Writer and then using the IFile Reader to deserialize the back into objects.

We are observing the same behaviors:

# All Mappers finish without error
# Most of the reducers finish without error, but one reducer keeps failing with 
the above error.

Interestingly, our Map Input and Output are the same classes.  That is to say, 
the Mappers filter and normalize the data, they don't create it.  The 
{{AvroKey}} and {{AvroValue}} parameters passed to the Mapper are then returned 
by the Mapper.  There are no {{AvroKey}} or {{AvroValue}} classes instantiated 
in the Mapper.  They are passed directly through the Map method.  This 
demonstrates that the entire data set can be read by the Avro serialization 
framework, but something goes wrong during the intermediate file stage.  This 
may be an issue with how the Hadoop MapReduce framework interacts with Avro's 
library.


was (Author: belugabehr):
I created a Fuzzer to test random combinations of values to try to reproduce 
this issue, but no luck.  During my investigation, I hit [HADOOP-11678] because 
I forgot to set the Serializer to use 
{{org.apache.avro.hadoop.io.AvroSerialization}} and I was instead getting the 
default Hadoop Avro Serialization class of 
{{org.apache.hadoop.io.serializer.avro.AvroSerialization}} which is very broken 
because it's erroneously buffering data.  However, during the time I was not 
setting the Serialization class correctly, I was getting some very similar 
errors that is recorded here.  So, it makes me think that there may be a buffer 
that isn't being flushed during serialization.  Once i corrected the mistake, 
my Fuzzer ran for several hours and did not encounter any issues serializing or 
deserializing the random data.  I was serializing the data directly to an 
[IFile|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/IFile.java]
 Reader and then using the IFile Reader to deserialize the back into objects.

We are observing the same behaviors:

# All Mappers finish without error
# Most of the reducers finish without error, but one reducer keeps failing with 
the above error.

Interestingly, our Map Input and Output are the same classes.  That is to say, 
the Mappers filter and normalize the data, they don't create it.  The 
{{AvroKey}} and {{AvroValue}} parameters passed to the Mapper are then returned 
by the Mapper.  There are no {{AvroKey}} or {{AvroValue}} classes instantiated 
in the Mapper.  They are passed directly through the Map method.  This 
demonstrates that the entire data set can be read by the Avro serialization 
framework, but something goes wrong during the intermediate file stage.  This 
may be an issue with how the Hadoop MapReduce framework interacts with Avro's 
library.

> Strange IndexOutofBoundException in GenericDatumReader.readString
> -
>
> Key: AVRO-1786
> URL: https://issues.apache.org/jira/browse/AVRO-1786
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.7.4, 1.7.7
> Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64
> Use IBM JVM:
> IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 
> 20140515_199835 (JIT enabled, AOT enabled)
>Reporter: Yong Zhang
>Priority: Minor
>
> Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running 
> I

[jira] [Comment Edited] (AVRO-1786) Strange IndexOutofBoundException in GenericDatumReader.readString

2017-08-01 Thread BELUGA BEHR (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16110083#comment-16110083
 ] 

BELUGA BEHR edited comment on AVRO-1786 at 8/2/17 1:08 AM:
---

I created a Fuzzer to test random combinations of values to try to reproduce 
this issue, but no luck.  During my investigation, I hit [HADOOP-11678] because 
I forgot to set the Serializer to use 
{{org.apache.avro.hadoop.io.AvroSerialization}} and I was instead getting the 
default Hadoop Avro Serialization class of 
{{org.apache.hadoop.io.serializer.avro.AvroSerialization}} which is very broken 
because it's erroneously buffering data.  However, during the time I was not 
setting the Serialization class correctly, I was getting some very similar 
errors that is recorded here.  So, it makes me think that there may be a buffer 
that isn't being flushed during serialization.  Once i corrected the mistake, 
my Fuzzer ran for several hours and did not encounter any issues serializing or 
deserializing the random data.  I was serializing the data directly to an 
[IFile|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/IFile.java]
 Reader and then using the IFile Reader to deserialize the back into objects.

We are observing the same behaviors:

# All Mappers finish without error
# Most of the reducers finish without error, but one reducer keeps failing with 
the above error.

Interestingly, our Map Input and Output are the same classes.  That is to say, 
the Mappers filter and normalize the data, they don't create it.  The 
{{AvroKey}} and {{AvroValue}} parameters passed to the Mapper are then returned 
by the Mapper.  There are no {{AvroKey}} or {{AvroValue}} classes instantiated 
in the Mapper.  They are passed directly through the Map method.  This 
demonstrates that the entire data set can be read by the Avro serialization 
framework, but something goes wrong during the intermediate file stage.  This 
may be an issue with how the Hadoop MapReduce framework interacts with Avro's 
library.


was (Author: belugabehr):
I created a Fuzzer to test random combinations of values to try to reproduce 
this issue, but no luck.  During my investigation, I hit [HADOOP-11678] because 
I forgot to set the Serializer to use 
{{org.apache.avro.hadoop.io;AvroSerialization}} and I was instead getting the 
default Hadoop Avro Serialization class of 
{{org.apache.hadoop.io.serializer.avro.AvroSerialization}} which is very broken 
because it's erroneously buffering data.  However, during the time I was not 
setting the Serialization class correctly, I was getting some very similar 
errors that is recorded here.  So, it makes me think that there may be a buffer 
that isn't being flushed during serialization.  Once i corrected the mistake, 
my Fuzzer ran for several hours and did not encounter any issues serializing or 
deserializing the random data.  I was serializing the data directly to an 
[IFile|https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/IFile.java]
 Reader and then using the IFile Reader to deserialize the back into objects.

We are observing the same behaviors:

# All Mappers finish without error
# Most of the reducers finish without error, but one reducer keeps failing with 
the above error.

Interestingly, our Map input and Output are the same classes.  That is to say, 
the Mappers filter and normalize the data, they don't create it.  The 
{{AvroKey}} and {{AvroValue}} parameters passed to the Mapper are then returned 
by the Mapper.  There are no {{AvroKey}} or {{AvroValue}} classes instantiated 
in the Mapper.  They are passed directly through the Map method.  This 
demonstrates that the entire data set can be read by the Avro serialization 
framework, but something goes wrong during the intermediate file stage.  This 
may be an issue with how the Hadoop MapReduce framework interacts with Avro's 
library.

> Strange IndexOutofBoundException in GenericDatumReader.readString
> -
>
> Key: AVRO-1786
> URL: https://issues.apache.org/jira/browse/AVRO-1786
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.7.4, 1.7.7
> Environment: CentOS 6.5 Linux x64, 2.6.32-358.14.1.el6.x86_64
> Use IBM JVM:
> IBM J9 VM (build 2.7, JRE 1.7.0 Linux amd64-64 Compressed References 
> 20140515_199835 (JIT enabled, AOT enabled)
>Reporter: Yong Zhang
>Priority: Minor
>
> Our production cluster is CENTOS 6.5 (2.6.32-358.14.1.el6.x86_64), running 
> IBM BigInsight V3.0.0.2. In Apache