Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Heejong Lee Thu, 04 Apr 2019 15:50:34 -0700

Robert, does nested/unnested context work properly for Java? I can see that
the Context is fixed to NESTED[1] and the encode method with the Context
parameter is marked as deprecated[2].


[1]:
https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
[2]:
https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132

On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw <[email protected]> wrote:

> I don't know why there are two separate copies of
> standard_coders.yaml--originally there was just one (though it did
> live in the Python directory). I'm guessing a copy was made rather
> than just pointing both to the new location, but that completely
> defeats the point. I can't seem to access JIRA right now; could
> someone file an issue to resolve this?
>
> I also think the spec should be next to the definition of the URN,
> that's one of the reason the URNs were originally in a markdown file
> (to encourage good documentation, literate programming style). Many
> coders already have their specs there.
>
> Regarding backwards compatibility, we can't change existing coders,
> and making new coders won't help with inference ('cause changing that
> would also be backwards incompatible). Fortunately, I think we're
> already doing the consistent thing here: In both Python and Java the
> raw UTF-8 encoded bytes are encoded when used in an *unnested* context
> and the length-prefixed UTF-8 encoded bytes are used when the coder is
> used in a *nested* context.
>
> I'd really like to see the whole nested/unnested context go away, but
> that'll probably require Beam 3.0; it causes way more confusion than
> the couple of bytes it saves in a couple of places.
>
> - Robert
>
> On Thu, Apr 4, 2019 at 10:55 PM Robert Burke <[email protected]> wrote:
> >
> > My 2cents is that the "Textual description" should be part of the
> documentation of the URNs on the Proto messages, since that's the common
> place. I've added a short description for the varints for example, and we
> already have lenghthier format & protocol descriptions there for iterables
> and similar.
> >
> > The proto [1] *can be* the spec if we want it to be.
> >
> > [1]:
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
> >
> > On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <[email protected]> wrote:
> >>
> >>
> >>
> >> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <[email protected]> wrote:
> >>>
> >>> We should probably move the "java" version of the yaml file [1] to a
> common location rather than deep in the java hierarchy, or copying it for
> Go and Python, but that can be a separate task. It's probably non-trivial
> since it looks like it's part of a java resources structure.
> >>
> >>
> >> Seems like /model is a good place for this if we don't want to invent a
> new language-independent hierarchy.
> >>
> >> Kenn
> >>
> >>
> >>>
> >>> Luke, the Go SDK doesn't currently do this validation, but it
> shouldn't be difficult, given pointers to the Java and Python variants of
> the tests to crib from [2]. Care would need to be taken so that Beam Go SDK
> users (such as they are) aren't forced to run them, and not have the yaml
> file to read. I'd suggest putting it with the integration tests [3].
> >>>
> >>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK side work. [4]
> >>>
> >>> 1:
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
> >>> 2:
> https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
> >>> 3: https://github.com/apache/beam/tree/master/sdks/go/test
> >>> 4: https://issues.apache.org/jira/browse/BEAM-7009
> >>>
> >>>
> >>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <[email protected]> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <
> [email protected]> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <[email protected]>
> wrote:
> >>>>>>
> >>>>>> standard_coders.yaml[1] is where we are currently defining these
> formats.
> >>>>>> Unfortunately the Python SDK has its own copy[2].
> >>>>>
> >>>>>
> >>>>> Ah great. Thanks for the pointer. Any idea why there's  a separate
> copy for Python ? I didn't see a significant difference in definitions
> looking at few random coders there but I might have missed something. If
> there's no reason to maintain two, we should probably unify.
> >>>>> Also, seems like we haven't added the definition for UTF-8 coder yet.
> >>>>>
> >>>>
> >>>> Not certain as well. I did notice the timer coder definition didn't
> exist in the Python copy.
> >>>>
> >>>>>>
> >>>>>>
> >>>>>> Here is an example PR[3] that adds the "beam:coder:double:v1" as
> tests to the Java and Python SDKs to ensure interoperability.
> >>>>>>
> >>>>>> Robert Burke, does the Go SDK have a test where it uses
> standard_coders.yaml and runs compatibility tests?
> >>>>>>
> >>>>>> Chamikara, creating new coder classes is a pain since the type ->
> coder mapping per SDK language would select the non-well known type if we
> added a new one to a language. If we swapped the default type->coder
> mapping, this would still break update for pipelines forcing users to
> update their code to select the non-well known type. If we don't change the
> default type->coder mapping, the well known coder will gain little usage. I
> think we should fix the Python coder to use the same encoding as Java for
> UTF-8 strings before there are too many Python SDK users.
> >>>>>
> >>>>>
> >>>>> I was thinking that may be we should just change the default UTF-8
> coder for Fn API path which is experimental. Updating Python to do what's
> done for Java is fine if we agree that encoding used for Java should be the
> standard.
> >>>>>
> >>>>
> >>>> That is a good idea to use the Fn API experiment to control which
> gets selected.
> >>>>
> >>>>>>
> >>>>>>
> >>>>>> 1:
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
> >>>>>> 2:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
> >>>>>> 3: https://github.com/apache/beam/pull/8205
> >>>>>>
> >>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <
> [email protected]> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <
> [email protected]> wrote:
> >>>>>>>>
> >>>>>>>> A URN defines the encoding.
> >>>>>>>>
> >>>>>>>> There are (unfortunately) *two* encodings defined for a Coder
> (defined
> >>>>>>>> by a URN), the nested and the unnested one. IIRC, in both Java and
> >>>>>>>> Python, the nested one prefixes with a var-int length, and the
> >>>>>>>> unnested one does not.
> >>>>>>>
> >>>>>>>
> >>>>>>> Could you clarify where we define the exact encoding ? I only see
> a URN for UTF-8 [1] while if you look at the implementations Java includes
> length in the encoding [1] while Python [1] does not.
> >>>>>>>
> >>>>>>> [1]
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
> >>>>>>> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
> >>>>>>> [3]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> We should define the spec clearly and have cross-language tests.
> >>>>>>>
> >>>>>>>
> >>>>>>> +1
> >>>>>>>
> >>>>>>> Regarding backwards compatibility, I agree that we should probably
> not update existing coder classes. Probably we should just standardize the
> correct encoding (may be as a comment near corresponding URN in the
> beam_runner_api.proto ?) and create new coder classes as needed.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <[email protected]>
> wrote:
> >>>>>>>> >
> >>>>>>>> > Could this be a backwards-incompatible change that would break
> pipelines from upgrading? If they have data in-flight in between operators,
> and we change the coder, they would break?
> >>>>>>>> > I know very little about coders, but since nobody has mentioned
> it, I wanted to make sure we have it in mind.
> >>>>>>>> > -P.
> >>>>>>>> >
> >>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <[email protected]>
> wrote:
> >>>>>>>> >>
> >>>>>>>> >> Agree that a coder URN defines the encoding. I see that string
> UTF-8 was added to the proto enum, but it needs a written spec of the
> encoding. Ideally some test data that different languages can use to drive
> compliance testing.
> >>>>>>>> >>
> >>>>>>>> >> Kenn
> >>>>>>>> >>
> >>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <
> [email protected]> wrote:
> >>>>>>>> >>>
> >>>>>>>> >>> String UTF8 was recently added as a "standard coder " URN in
> the protos, but I don't think that developed beyond Java, so adding it to
> Python would be reasonable in my opinion.
> >>>>>>>> >>>
> >>>>>>>> >>> The Go SDK handles Strings as "custom coders" presently which
> for Go are always length prefixed (and reported to the Runner as
> LP+CustomCoder). It would be straight forward to add the correct handling
> for strings, as Go natively treats strings as UTF8.
> >>>>>>>> >>>
> >>>>>>>> >>>
> >>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <[email protected]>
> wrote:
> >>>>>>>> >>>>
> >>>>>>>> >>>> Hi all,
> >>>>>>>> >>>>
> >>>>>>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs
> uses different encoding schemes. StringUtf8Coder in Java SDK puts the
> varint length of the input string before actual data bytes however
> StrUtf8Coder in Python SDK directly encodes the input string to bytes
> value. For the last few weeks, I've been testing and fixing cross-language
> IO transforms and this discrepancy is a major blocker for me. IMO, we
> should unify the encoding schemes of UTF8 strings across the different SDKs
> and make it a standard coder. Any thoughts?
> >>>>>>>> >>>>
> >>>>>>>> >>>> Thanks,
>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Reply via email to