Agree that a coder URN defines the encoding. I see that string UTF-8 was
added to the proto enum, but it needs a written spec of the encoding.
Ideally some test data that different languages can use to drive compliance
testing.

Kenn

On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <rob...@frantil.com> wrote:

> String UTF8 was recently added as a "standard coder " URN in the protos,
> but I don't think that developed beyond Java, so adding it to Python would
> be reasonable in my opinion.
>
> The Go SDK handles Strings as "custom coders" presently which for Go are
> always length prefixed (and reported to the Runner as LP+CustomCoder). It
> would be straight forward to add the correct handling for strings, as Go
> natively treats strings as UTF8.
>
>
> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <heej...@google.com> wrote:
>
>> Hi all,
>>
>> It looks like UTF-8 String Coder in Java and Python SDKs uses different
>> encoding schemes. StringUtf8Coder in Java SDK puts the varint length of the
>> input string before actual data bytes however StrUtf8Coder in Python SDK
>> directly encodes the input string to bytes value. For the last few weeks,
>> I've been testing and fixing cross-language IO transforms and this
>> discrepancy is a major blocker for me. IMO, we should unify the encoding
>> schemes of UTF8 strings across the different SDKs and make it a standard
>> coder. Any thoughts?
>>
>> Thanks,
>>
>

Reply via email to