[ 
https://issues.apache.org/jira/browse/HADOOP-11678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Willson updated HADOOP-11678:
-------------------------------------
    Description: 
We've had issues with the deserializer running into EOFException when using 
Cascading's TupleSerialization (which delegates to other hadoop serializers to 
serialize entries within its tuples) in combination with AvroSerialization.

Eventually tracked it down to the fact that AvroSerialization#AvroSerializer is 
buffering output (since it uses a buffering BinaryEncoder rather than a 
non-buffering directBinaryEncoder):

https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/serializer/avro/AvroSerialization.java#L105

The contract for Serializer explicitly states "Serializers ... must not buffer 
the output since other producers may write to the output between calls to 
#serialize(Object)."

TupleSerialization does exactly that (write to the output between calls to 
#serialize), hence our problem.

Switching it to use the non-buffering `EncoderFactory#directBinaryEncoder` and 
`DecoderFactory#directBinaryDecoder` fixes the issue for us.

  was:
We've had issues with the deserializer running into EOFException when using 
Cascading's TupleSerialization (which delegates to other hadoop serializers to 
serialize entries within its tuples) in combination with AvroSerialization.

Eventually tracked it down to the fact that AvroSerialization#AvroSerializer is 
buffering output (since it uses an avro BinaryEncoder):

https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/serializer/avro/AvroSerialization.java#L105

The contract for Serializer explicitly states "Serializers ... must not buffer 
the output since other producers may write to the output between calls to 
#serialize(Object)."

TupleSerialization does exactly that (write to the output between calls to 
#serialize), hence our problem.

It's not sufficient just to flush the encoder on close, it needs to be flushed 
after every write. Doing this fixes our issue.


> AvroSerializer buffers output in violation of contract for Serializer
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-11678
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11678
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Matthew Willson
>
> We've had issues with the deserializer running into EOFException when using 
> Cascading's TupleSerialization (which delegates to other hadoop serializers 
> to serialize entries within its tuples) in combination with AvroSerialization.
> Eventually tracked it down to the fact that AvroSerialization#AvroSerializer 
> is buffering output (since it uses a buffering BinaryEncoder rather than a 
> non-buffering directBinaryEncoder):
> https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/serializer/avro/AvroSerialization.java#L105
> The contract for Serializer explicitly states "Serializers ... must not 
> buffer the output since other producers may write to the output between calls 
> to #serialize(Object)."
> TupleSerialization does exactly that (write to the output between calls to 
> #serialize), hence our problem.
> Switching it to use the non-buffering `EncoderFactory#directBinaryEncoder` 
> and `DecoderFactory#directBinaryDecoder` fixes the issue for us.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to