A URN defines the encoding. There are (unfortunately) *two* encodings defined for a Coder (defined by a URN), the nested and the unnested one. IIRC, in both Java and Python, the nested one prefixes with a var-int length, and the unnested one does not.
We should define the spec clearly and have cross-language tests. On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <pabl...@google.com> wrote: > > Could this be a backwards-incompatible change that would break pipelines from > upgrading? If they have data in-flight in between operators, and we change > the coder, they would break? > I know very little about coders, but since nobody has mentioned it, I wanted > to make sure we have it in mind. > -P. > > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <k...@apache.org> wrote: >> >> Agree that a coder URN defines the encoding. I see that string UTF-8 was >> added to the proto enum, but it needs a written spec of the encoding. >> Ideally some test data that different languages can use to drive compliance >> testing. >> >> Kenn >> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <rob...@frantil.com> wrote: >>> >>> String UTF8 was recently added as a "standard coder " URN in the protos, >>> but I don't think that developed beyond Java, so adding it to Python would >>> be reasonable in my opinion. >>> >>> The Go SDK handles Strings as "custom coders" presently which for Go are >>> always length prefixed (and reported to the Runner as LP+CustomCoder). It >>> would be straight forward to add the correct handling for strings, as Go >>> natively treats strings as UTF8. >>> >>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <heej...@google.com> wrote: >>>> >>>> Hi all, >>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses different >>>> encoding schemes. StringUtf8Coder in Java SDK puts the varint length of >>>> the input string before actual data bytes however StrUtf8Coder in Python >>>> SDK directly encodes the input string to bytes value. For the last few >>>> weeks, I've been testing and fixing cross-language IO transforms and this >>>> discrepancy is a major blocker for me. IMO, we should unify the encoding >>>> schemes of UTF8 strings across the different SDKs and make it a standard >>>> coder. Any thoughts? >>>> >>>> Thanks,