[ 
https://issues.apache.org/jira/browse/AVRO-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13924222#comment-13924222
 ] 

Doug Cutting commented on AVRO-1475:
------------------------------------

I would expect errors while writing in the mapper to fail the mapper and 
subsequently fail the job.  If an application were to catch that exception and 
continue then it might well produce the error you describe, but this should not 
happen without an application catching and ignoring an exception.  If it does, 
that's a bug we should fix.  A reproducible test case would help in this case.

That said, we might add a feature to permit applications to catch exceptions 
while writing map outputs and continue.  This, as you suggest, would require 
another layer of buffering.  Note that DataFileWriter has such a buffer and 
already permits one to continue after such exceptions, but map outputs do not 
use DataFileWriter and do not yet permit this.

> Bad Avro write in mapper causes ArrayIndexOutOfBoundsException in reducer
> -------------------------------------------------------------------------
>
>                 Key: AVRO-1475
>                 URL: https://issues.apache.org/jira/browse/AVRO-1475
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.7.6
>         Environment: hadoop 1.0.3
>            Reporter: Cliff Resnick
>
> This may or may not be a retread of AVRO-792 (which was inconclusively 
> resolved) , but it's consistently occurring in 1.7.6. The role of the map 
> phase in the problem was not addressed in AVRO-792 so I'm bringing it up here.
> My hypothesis is that the AvroSerializer sinks bytes to output as it 
> validates, so when a write of a schema-invalid Datum is attempted in the 
> mapper, the data written up to that point makes its way to the reducer, even 
> as an exception is thrown in the mapper. This results in a reduce failing 
> with a mysterious ArrayIndexOutOfBoundsException as below. To test this I had 
> the AvroDeserializer catch and log rather than throw an exception, and the 
> cardinalities of reduce errors match exactly with the mapper errors.
> The obvious problem with the current situation is that there is no way for a 
> mapreduce job to omit bad data without doubly serializing it, first to verify 
> and then to write. A proper solution would be for the Encoder to 
> transactionally buffer the serialized bytes so that when the serialization 
> breaks, it's caught in the mapper without blowing the entire mapreduce (since 
> the error below occurs before any "userland" mapreduce code is called).
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 10
>       at 
> org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:365)
>       at 
> org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
>       at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>       at 
> org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
>       at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
>       at 
> org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
>       at 
> org.apache.avro.reflect.ReflectDatumReader.readField(ReflectDatumReader.java:230)
>       at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
>       at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
>       at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
>       at 
> org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:123)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to