On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <chamik...@google.com> wrote:
> > > On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <rober...@google.com> > wrote: > >> A URN defines the encoding. >> >> There are (unfortunately) *two* encodings defined for a Coder (defined >> by a URN), the nested and the unnested one. IIRC, in both Java and >> Python, the nested one prefixes with a var-int length, and the >> unnested one does not. >> > > Could you clarify where we define the exact encoding ? I only see a URN > for UTF-8 [1] while if you look at the implementations Java includes length > in the encoding [1] while Python [1] does not. > > [1] > https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563 > [2] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50 > [3] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321 > > > >> >> We should define the spec clearly and have cross-language tests. >> > > +1 > > Regarding backwards compatibility, I agree that we should probably not > update existing coder classes. Probably we should just standardize the > correct encoding (may be as a comment near corresponding URN in the > beam_runner_api.proto ?) and create new coder classes as needed. > Then how do we pair the type and the coder? For Java we can explicitly assign a specific coder to PCollection but for python a coder is inferred from an element type of PCollection. If we create another standard coder for utf-8 string, would that new coder be a default for the string element type? > > >> >> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <pabl...@google.com> wrote: >> > >> > Could this be a backwards-incompatible change that would break >> pipelines from upgrading? If they have data in-flight in between operators, >> and we change the coder, they would break? >> > I know very little about coders, but since nobody has mentioned it, I >> wanted to make sure we have it in mind. >> > -P. >> > >> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <k...@apache.org> wrote: >> >> >> >> Agree that a coder URN defines the encoding. I see that string UTF-8 >> was added to the proto enum, but it needs a written spec of the encoding. >> Ideally some test data that different languages can use to drive compliance >> testing. >> >> >> >> Kenn >> >> >> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <rob...@frantil.com> >> wrote: >> >>> >> >>> String UTF8 was recently added as a "standard coder " URN in the >> protos, but I don't think that developed beyond Java, so adding it to >> Python would be reasonable in my opinion. >> >>> >> >>> The Go SDK handles Strings as "custom coders" presently which for Go >> are always length prefixed (and reported to the Runner as LP+CustomCoder). >> It would be straight forward to add the correct handling for strings, as Go >> natively treats strings as UTF8. >> >>> >> >>> >> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <heej...@google.com> wrote: >> >>>> >> >>>> Hi all, >> >>>> >> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses >> different encoding schemes. StringUtf8Coder in Java SDK puts the varint >> length of the input string before actual data bytes however StrUtf8Coder in >> Python SDK directly encodes the input string to bytes value. For the last >> few weeks, I've been testing and fixing cross-language IO transforms and >> this discrepancy is a major blocker for me. IMO, we should unify the >> encoding schemes of UTF8 strings across the different SDKs and make it a >> standard coder. Any thoughts? >> >>>> >> >>>> Thanks, >> >