UTF-8 passthrough with beam Rows

Steve Niemitz Tue, 30 Nov 2021 07:49:36 -0800

A common use case we're running into with beam rows is something like:
- Read data from source X
- Convert to Row
- Encode row (generally for xlang)


In cases like this, I've noticed that we spend a significant (30%+) amount
of time just decoding and re-encoding strings.

Avro has a nice solution to this with its Utf8 class [1] which defers
decoding the string until actually needed.  I'm curious if there's been any
thought around optimizing this in beam as well?  It doesn't seem like it'd
be hard to support it in the RowCoder implementation right now.

[1]
https://avro.apache.org/docs/1.4.1/api/java/org/apache/avro/util/Utf8.html

UTF-8 passthrough with beam Rows

Reply via email to