Hello!  Good catch, I'm taking a look, but it looks like you're
entirely correct and there isn't any obvious workaround.  I guess you
could regenerate every SpecificRecord class in order to add the
"java-class" or "avro.java.string" annotation, but that shouldn't be
necessary.

>From the Avro perspective, we should always have been using
SpecificDatumReader/Writer for all generated SpecificRecords...  We
would still have the same Utf8 and .toString problems, but at least
there would be no change in behaviour during migration :/

As a side note, the Apache Avro project should probably reconsider
whether the Utf8 class still adds any value with modern JVMs!  If I
understand correctly, it was originally in place because Hadoop had a
performance boost when it could reuse mutable data containers.

Moving forward, I think your suggestion is the most pragmatic: either
add a configuration option to AvroCoder to always drop to ReflectData,
or explicitly provide a ReflectAvroCoder that only uses reflection.

I took the liberty of creating the JIRA
https://issues.apache.org/jira/browse/BEAM-12628 JIRA, so I could
create an link an Avro issue!  Please feel free to update if I missed
anything.

Best regards, Ryan

On Thu, Jul 15, 2021 at 10:53 PM Claire McGinty
<[email protected]> wrote:
>
> Hi all,
>
> When upgrading from Beam 2.29.0 to 2.30.0, we encountered some unexpected 
> runtime issues due to changes from BEAM-2303. This PR updated  AvroCoder to 
> use SpecificDatum{Reader,Writer} instead ofReflectDatum{Reader,Writer} in its 
> implementation.
>
> When using the Reflect* suite, Avro string fields have getters/setters 
> defined with a CharSequence signature, but are by default decoded as 
> java.lang.Strings [1]. But the Specific* suitehas a different default 
> behavior for decoding Avro string fields: unless the Avro schema property 
> "java-class" is set to "java.lang.String", the decoded CharSequences will by 
> default be implemented as org.apache.avro.util.Utf8 objects [2].
>
> This is causing some migration pain for us as we're having to either add the 
> java-class property to all string field schemas, or call .toString on a lot 
> of fields we could just cast before. Additionally, Utf8 isn't Serializable 
> and there's no default Coder representation for it. Beam's 
> AvroSink/AvroSource still use the Reflect* reader/writer, as well.I created a 
> quick Gist to demonstrate the issue: [3].
>
> I'm wondering if there's any possibility of making the use of Reflect* vs 
> Specific* configurable in AvroCoder, or maybe setting a default String type 
> in the coder constructor.  If not, maybe this change should be documented in 
> the release notes?
>
> Thanks,
> Claire

Reply via email to