Cliff Resnick created AVRO-1475:
-----------------------------------

             Summary: Bad Avro write in mapper causes 
ArrayIndexOutOfBoundsException in reducer
                 Key: AVRO-1475
                 URL: https://issues.apache.org/jira/browse/AVRO-1475
             Project: Avro
          Issue Type: Bug
          Components: java
    Affects Versions: 1.7.6
         Environment: hadoop 1.0.3
            Reporter: Cliff Resnick


This may or may not be a retread of AVRO-792 (which was inconclusively 
resolved) , but it's consistently occurring in 1.7.6. The role of the map phase 
in the problem was not addressed in AVRO-792 so I'm bringing it up here.

My hypothesis is that the AvroSerializer sinks bytes to output as it validates, 
so when a write of a schema-invalid Datum is attempted in the mapper, the data 
written up to that point makes its way to the reducer, even as an exception is 
thrown in the mapper. This results in a reduce failing with a mysterious 
ArrayIndexOutOfBoundsException as below. To test this I had the 
AvroDeserializer catch and log rather than throw an exception, and the 
cardinalities of reduce errors match exactly with the mapper errors.

The obvious problem with the current situation is that there is no way for a 
mapreduce job to omit bad data without doubly serializing it, first to verify 
and then to write. A proper solution would be for the Encoder to 
transactionally buffer the serialized bytes so that when the serialization 
breaks, it's caught in the mapper without blowing the entire mapreduce (since 
the error below occurs before any "userland" mapreduce code is called).




Caused by: java.lang.ArrayIndexOutOfBoundsException: 10
        at 
org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:365)
        at 
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
        at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
        at 
org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
        at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
        at 
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
        at 
org.apache.avro.reflect.ReflectDatumReader.readField(ReflectDatumReader.java:230)
        at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
        at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
        at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
        at 
org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:123)





--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to