Hi all,

When upgrading from Beam 2.29.0 to 2.30.0, we encountered some unexpected
runtime issues due to changes from BEAM-2303
<https://github.com/apache/beam/pull/14410>. This PR updated  AvroCoder to
use SpecificDatum{Reader,Writer} instead ofReflectDatum{Reader,Writer} in
its implementation.

When using the Reflect* suite, Avro string fields have getters/setters
defined with a CharSequence signature, but are by default decoded as
java.lang.Strings [1]
<https://github.com/apache/avro/blob/release-1.8.2/lang/java/avro/src/main/java/org/apache/avro/reflect/ReflectDatumReader.java#L229>.
But the Specific* suitehas a different default behavior for decoding Avro
string fields: unless the Avro schema property "java-class" is set to
"java.lang.String", the decoded CharSequences will by default be
implemented as org.apache.avro.util.Utf8 objects [2]
<https://github.com/apache/avro/blob/release-1.8.2/lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumReader.java#L408>
.

This is causing some migration pain for us as we're having to either add
the java-class property to all string field schemas, or call .toString on a
lot of fields we could just cast before. Additionally, Utf8 isn't
Serializable and there's no default Coder representation for it. Beam's
AvroSink/AvroSource still use the Reflect* reader/writer, as well.I created
a quick Gist to demonstrate the issue: [3]
<https://gist.github.com/clairemcginty/97ee6b33c0b5633d5d42d29b1d057d85>.
I'm wondering if there's any possibility of making the use of Reflect* vs
Specific* configurable in AvroCoder, or maybe setting a default String type
in the coder constructor.  If not, maybe this change should be documented
in the release notes?

Thanks,
Claire

Reply via email to