Agree that a coder URN defines the encoding. I see that string UTF-8 was added to the proto enum, but it needs a written spec of the encoding. Ideally some test data that different languages can use to drive compliance testing.
Kenn On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <rob...@frantil.com> wrote: > String UTF8 was recently added as a "standard coder " URN in the protos, > but I don't think that developed beyond Java, so adding it to Python would > be reasonable in my opinion. > > The Go SDK handles Strings as "custom coders" presently which for Go are > always length prefixed (and reported to the Runner as LP+CustomCoder). It > would be straight forward to add the correct handling for strings, as Go > natively treats strings as UTF8. > > > On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <heej...@google.com> wrote: > >> Hi all, >> >> It looks like UTF-8 String Coder in Java and Python SDKs uses different >> encoding schemes. StringUtf8Coder in Java SDK puts the varint length of the >> input string before actual data bytes however StrUtf8Coder in Python SDK >> directly encodes the input string to bytes value. For the last few weeks, >> I've been testing and fixing cross-language IO transforms and this >> discrepancy is a major blocker for me. IMO, we should unify the encoding >> schemes of UTF8 strings across the different SDKs and make it a standard >> coder. Any thoughts? >> >> Thanks, >> >