[
https://issues.apache.org/jira/browse/HADOOP-11678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
BELUGA BEHR reassigned HADOOP-11678:
------------------------------------
Assignee: BELUGA BEHR
> AvroSerializer buffers output in violation of contract for Serializer
> ---------------------------------------------------------------------
>
> Key: HADOOP-11678
> URL: https://issues.apache.org/jira/browse/HADOOP-11678
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 2.6.0
> Reporter: Matthew Willson
> Assignee: BELUGA BEHR
> Attachments: HADOOP-11678.1.patch, HADOOP-11678.2.patch
>
>
> We've had issues with the deserializer running into EOFException when using
> Cascading's TupleSerialization (which delegates to other hadoop serializers
> to serialize entries within its tuples) in combination with AvroSerialization.
> Eventually tracked it down to the fact that AvroSerialization#AvroSerializer
> is buffering output (since it uses a buffering EncoderFactory#binaryEncoder
> rather than a non-buffering EncoderFactory#directBinaryEncoder):
> https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/serializer/avro/AvroSerialization.java#L105
> The contract for Serializer explicitly states "Serializers ... must not
> buffer the output since other producers may write to the output between calls
> to #serialize(Object)." TupleSerialization does exactly that (write to the
> output between calls to #serialize), hence our problem.
> There's a similar problem with the AvroDeserializer too -- it uses a
> buffering binaryDecoder, and this can consume the underlying InputStream
> beyond the end of the datum it's decoding, meaning that if a different
> Deserializer is used to read the next item, it'll start off in the wrong
> place and get confused.
> Switching AvroSerializer and AvroDeserializer to use the non-buffering
> `EncoderFactory#directBinaryEncoder` and `DecoderFactory#directBinaryDecoder`
> fixes the issue for us.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]