Matthew Willson created HADOOP-11678:
----------------------------------------
Summary: AvroSerializer buffers output in violation of contract
for Serializer
Key: HADOOP-11678
URL: https://issues.apache.org/jira/browse/HADOOP-11678
Project: Hadoop Common
Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Matthew Willson
We've had issues with the deserializer running into EOFException when using
Cascading's TupleSerialization (which delegates to other hadoop serializers to
serialize entries within its tuples) in combination with AvroSerialization.
Eventually tracked it down to the fact that AvroSerialization#AvroSerializer is
buffering output (since it uses an avro BinaryEncoder):
https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/serializer/avro/AvroSerialization.java#L105
The contract for Serializer explicitly states "Serializers ... must not buffer
the output since other producers may write to the output between calls to
#serialize(Object)."
TupleSerialization does exactly that (write to the output between calls to
#serialize), hence our problem.
It's not sufficient just to flush the encoder on close, it needs to be flushed
after every write. Doing this fixes our issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)