Could this be a backwards-incompatible change that would break pipelines
from upgrading? If they have data in-flight in between operators, and we
change the coder, they would break?
I know very little about coders, but since nobody has mentioned it, I
wanted to make sure we have it in mind.
-P.

On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <k...@apache.org> wrote:

> Agree that a coder URN defines the encoding. I see that string UTF-8 was
> added to the proto enum, but it needs a written spec of the encoding.
> Ideally some test data that different languages can use to drive compliance
> testing.
>
> Kenn
>
> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <rob...@frantil.com> wrote:
>
>> String UTF8 was recently added as a "standard coder " URN in the protos,
>> but I don't think that developed beyond Java, so adding it to Python would
>> be reasonable in my opinion.
>>
>> The Go SDK handles Strings as "custom coders" presently which for Go are
>> always length prefixed (and reported to the Runner as LP+CustomCoder). It
>> would be straight forward to add the correct handling for strings, as Go
>> natively treats strings as UTF8.
>>
>>
>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <heej...@google.com> wrote:
>>
>>> Hi all,
>>>
>>> It looks like UTF-8 String Coder in Java and Python SDKs uses different
>>> encoding schemes. StringUtf8Coder in Java SDK puts the varint length of the
>>> input string before actual data bytes however StrUtf8Coder in Python SDK
>>> directly encodes the input string to bytes value. For the last few weeks,
>>> I've been testing and fixing cross-language IO transforms and this
>>> discrepancy is a major blocker for me. IMO, we should unify the encoding
>>> schemes of UTF8 strings across the different SDKs and make it a standard
>>> coder. Any thoughts?
>>>
>>> Thanks,
>>>
>>

Reply via email to