[ https://issues.apache.org/jira/browse/HADOOP-11678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
BELUGA BEHR updated HADOOP-11678: --------------------------------- Attachment: HADOOP-11678.2.patch > AvroSerializer buffers output in violation of contract for Serializer > --------------------------------------------------------------------- > > Key: HADOOP-11678 > URL: https://issues.apache.org/jira/browse/HADOOP-11678 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 2.6.0 > Reporter: Matthew Willson > Assignee: BELUGA BEHR > Attachments: HADOOP-11678.1.patch, HADOOP-11678.2.patch > > > We've had issues with the deserializer running into EOFException when using > Cascading's TupleSerialization (which delegates to other hadoop serializers > to serialize entries within its tuples) in combination with AvroSerialization. > Eventually tracked it down to the fact that AvroSerialization#AvroSerializer > is buffering output (since it uses a buffering EncoderFactory#binaryEncoder > rather than a non-buffering EncoderFactory#directBinaryEncoder): > https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/serializer/avro/AvroSerialization.java#L105 > The contract for Serializer explicitly states "Serializers ... must not > buffer the output since other producers may write to the output between calls > to #serialize(Object)." TupleSerialization does exactly that (write to the > output between calls to #serialize), hence our problem. > There's a similar problem with the AvroDeserializer too -- it uses a > buffering binaryDecoder, and this can consume the underlying InputStream > beyond the end of the datum it's decoding, meaning that if a different > Deserializer is used to read the next item, it'll start off in the wrong > place and get confused. > Switching AvroSerializer and AvroDeserializer to use the non-buffering > `EncoderFactory#directBinaryEncoder` and `DecoderFactory#directBinaryDecoder` > fixes the issue for us. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org