Avro String decoding changes in Beam 2.30.0

Claire McGinty Thu, 15 Jul 2021 13:53:15 -0700

Hi all,

When upgrading from Beam 2.29.0 to 2.30.0, we encountered some unexpected
runtime issues due to changes from BEAM-2303
<https://github.com/apache/beam/pull/14410>. This PR updated  AvroCoder to
use SpecificDatum{Reader,Writer} instead ofReflectDatum{Reader,Writer} in
its implementation.


When using the Reflect* suite, Avro string fields have getters/setters
defined with a CharSequence signature, but are by default decoded as
java.lang.Strings [1]
<https://github.com/apache/avro/blob/release-1.8.2/lang/java/avro/src/main/java/org/apache/avro/reflect/ReflectDatumReader.java#L229>.
But the Specific* suitehas a different default behavior for decoding Avro
string fields: unless the Avro schema property "java-class" is set to
"java.lang.String", the decoded CharSequences will by default be
implemented as org.apache.avro.util.Utf8 objects [2]
<https://github.com/apache/avro/blob/release-1.8.2/lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumReader.java#L408>
.

This is causing some migration pain for us as we're having to either add
the java-class property to all string field schemas, or call .toString on a
lot of fields we could just cast before. Additionally, Utf8 isn't
Serializable and there's no default Coder representation for it. Beam's
AvroSink/AvroSource still use the Reflect* reader/writer, as well.I created
a quick Gist to demonstrate the issue: [3]
<https://gist.github.com/clairemcginty/97ee6b33c0b5633d5d42d29b1d057d85>.
I'm wondering if there's any possibility of making the use of Reflect* vs
Specific* configurable in AvroCoder, or maybe setting a default String type
in the coder constructor.  If not, maybe this change should be documented
in the release notes?

Thanks,
Claire

Avro String decoding changes in Beam 2.30.0

Reply via email to