Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Robert Bradshaw Mon, 08 Apr 2019 16:04:06 -0700

On Mon, Apr 8, 2019 at 8:04 PM Kenneth Knowles <[email protected]> wrote:
>
> On Mon, Apr 8, 2019 at 1:57 AM Robert Bradshaw <[email protected]> wrote:
>>
>> On Sat, Apr 6, 2019 at 12:08 AM Kenneth Knowles <[email protected]> wrote:
>> >
>> > On Fri, Apr 5, 2019 at 2:24 PM Robert Bradshaw <[email protected]> wrote:
>> >>
>> >> On Fri, Apr 5, 2019 at 6:24 PM Kenneth Knowles <[email protected]> wrote:
>> >> >
>> >> > Nested and unnested contexts are two different encodings. Can we just 
>> >> > give them different URNs? We can even just express the length-prefixed 
>> >> > UTF-8 as a composition of the length-prefix URN and the UTF-8 URN.
>> >>
>> >> It's not that simple, especially when it comes to composite encodings.
>> >> E.g. for some coders, nested(C) == unnested(C), for some coders
>> >> nested(C) == lenth_prefix(unnested(C)), and for other coders it's
>> >> something else altogether (e.g. when creating a kv coder, the first
>> >> component must use nested context, and the second inherits the nested
>> >> vs. unnested context). When creating TupleCoder(A, B) one doesn't want
>> >> to forcibly use LenthPrefixCoder(A) and LengthPrefixCoder(B), nor does
>> >> one want to force LengthPrefixCoder(TupleCoder(A, B)) because A and B
>> >> may themselves be large and incrementally written (e.g.
>> >> IterableCoder). Using distinct URNs doesn't work well if the runner is
>> >> free to compose and decompose tuple, iterable, etc. coders that it
>> >> doesn't understand.
>> >>
>> >> Until we stop using Coders for IO (a worthy but probably lofty goal)
>> >> we will continue to need the unnested context (lest we expect and
>> >> produce length-prefixed coders in text files, as bigtable keys, etc.).
>> >> On the other hand, almost all internal use is nested (due to sending
>> >> elements around as part of element streams). The other place we cross
>> >> over is LengthPrefixCoder that encodes its values using the unnested
>> >> context prefixed by the unnested encoding length.
>> >>
>> >> Perhaps a step in the right direction would be to consistently use the
>> >> unnested context everywhere but IOs (meaning when we talked about
>> >> coders from the FnAPI perspective, they're *always* in the nested
>> >> context, and hence always have the one and only encoding defined by
>> >> that URN, including when wrapped by a length prefix coder (which would
>> >> sometimes result in double length prefixing, but I think that's a
>> >> price worth paying (or we could do something more clever like make
>> >> length-prefix an (explicit) modifier on a coder rather than a new
>> >> coder itself that would default to length prefixing (or some of the
>> >> other delimiting schemes we've come up with) but allow the component
>> >> coder to offer alternative length-prefix-compatible encodings))). IOs
>> >> could be updated to take Parsers and Formatters (or whatever we call
>> >> them) with the Coders in the constructors left as syntactic sugar
>> >> until we could remove them in 3.0. As Luke says, we have a chance to
>> >> fix our coders for portable pipelines now.
>> >>
>> >> In the (very) short term, we're stuck with a nested and unnested
>> >> version of StringUtf8, just as we have for bytes, lest we change the
>> >> meaning of (or disallow some of) TupleCoder[StrUtf8Coder, ...],
>> >> LengthPrefixCoder[StrUtf8Coder], and using StringUtf8Coder for IO.
>> >
>> >
>> > First, let's note that "nested" and "outer" are a misnomer. The 
>> > distinction is whether it is the last thing encoded in the stream. In a 
>> > KvCoder<KeyCoder, ValueCoder> the ValueCoder is actually encoded in the 
>> > "outer" context though the value is nested. No doubt a good amount of 
>> > confusion comes from the initial and continued use of this terminology.
>>
>> +1. I think of these as "self-delimiting" vs. "externally-delimited."
>>
>> > So, all that said, it is a simple fact that UTF-8 and length-prefixed 
>> > UTF-8 are two different encodings. Encodings are the fundamental concept 
>> > here and coders encapsulate two encodings, with some subtle and 
>> > inconsistently-applied rules about when to use which encoding. I think we 
>> > should still give them distinct URNs unless impossible. You've outlined 
>> > some steps to clarify the situation.
>>
>> Currently, we have a one-to-two relationship between Coders and
>> encodings, and a one-to-one relationship between URNs and encoders.
>>
>> To make the first one-to-one, we would either have to make
>> StringUtf8Coder unsuitable for TextIO (letting it always prefix its
>> contents with a length) or unsuitable for the key part of a KV, the
>> element of an iterable, etc. (where the length is required).
>
> Definitely in favor of dropping TextIO from the conversation. Just need to 
> think about compatibility, but I would save that for after the idealized end 
> goal is defined, and hack it somehow. to the one-to-one vs one-to-two, can we 
> have:
>
>  - SDK coder -> proto: be smart about knowing what the context is and uses 
> the appropriate URN (this is when building the graph)


So you'd have to day "give me the outer version of yourself" or "give
me the inner version of yourself" which would (sometimes) get passed
down to components. And this extra bit would also have to be set in
the object -> proto caches. It's the same nested/outer logic, just put
in a different place.

The context here is almost always "inner" because even if it's the
PCollection element's coder, it must be encoded in a self-delimiting
way as per the (current) data-plane protocol. On the other hand, if
you have a PCollection<KV> that's later used as a map side input, the
K appears alone (potentiall outer) during lookups. Of course one can
always use inner and not care about the redundancy.

>  - proto -> SDK: instantiate a variant of the coder that is fixed to one 
> context (this is when you get a process bundle instruction from the runner)
>
> I think this is more-or-less the idea you mean by...
>
>> Alternatively we could give Coders the ability to return the
>> nested/unnested version of themselves, but this also gets messy
>> because it depends on the ultimate outer context which we don't always
>> have at hand (and leads to surprises, e.g. asking for the key coder of
>> a KV coder may not return the same coder that was used to construct
>> it).

I was thinking of doing this at graph construction time, but at proto
translation time could be better (though it complicates that).

> ... and I don't think it would be too messy, because fixing the context would 
> happen around submission time, yes? Basically, it keep "Java Coder" as a 
> vestigial concept that compiles away, much like PValue/PInput/POutput.

Unfortunately nested vs. unnested is already a model-level concept
(e.g. Python does the same, and has been changed to match java
exactly). Not that we couldn't change this for FnApi running.

>> On the other hand, breaking the one-to-one relationship of Coders and
>> URNs is also undesirable, because it forces a choice when serializing
>> to protos (where one may not always know all the contexts it's used
>> in)
>
> When would you not know the context? I posit that toProto should always know 
> the context.
>
>> and makes the round-trip through protos non-idempotent (if the URN
>> gets decoded back into a Coder that does not have this dual-defined
>> encoding).
>
> This doesn't seem like such a problem. Of course, I would say that, as I am 
> proposing to do just this :-)

I think we'll have even more of this (happening even at pipeline
construction time, e.g. for cross-language pipelines).

>> Also, if a runner knows about special pairs of URNs that
>> represent the nested/unnested encodings of the same type, and rules
>> like knowing constraints on components of KVs, etc., this seems an
>> even worse situation than we're already in.
>
> Perhaps even worse than this would be to enshrine the idea of 
> self-delimiting/non-self-delimiting coders as a flag in the proto, so runners 
> know to flip it. That is really bad, making the model inherit accidental 
> complexity from the Java SDK. But is this flag always just a length prefix, 
> in practice? If so, then the smarts to add to toProto would cause the 
> self-delimiting variant of UTF-8 coder to be a composite LengthPrefix(UTF-8) 
> coder. Then the model doesn't change and the runner does not need to be aware 
> of special pairs of URNs.

No, most of the "basic" coders (e.g. ints, doubles, iterables, ...)
are already self-delimiting.

>> > I think the most meaningful issue is runners manipulating coders. But the 
>> > runner should be able to instruct the SDK (where the coder actually 
>> > executes) about which encoding to use. I need to think through some 
>> > examples of where the SDK tells the runner the encoding of something and 
>> > those where the runner fabricates steps and instructs the SDK what 
>> > encoding to use for elements. GroupByKey, GetKeys, etc, are examples where 
>> > the context will change.
>>
>> Combiner lifting and side inputs are good candidates as well.
>
>
> GBK example:
>
>  - Input: KvCoder[Outer]<KeyCoder[Inner], ValueCoder[Outer]>
>  - Output: KvCoder[Outer]<KeyCoder[Inner], 
> IterableCoder[Outer]<ValueCoder[Inner]>>
>
> In this case, it is the SDK providing both, so it can easily make the 
> ValueCoder[Inner] a LengthPrefix(ValueCoder[Outer]). If the ValueCoder is 
> already length-prefixed there is a cost. That could be detected and elided I 
> think; might need additional methods but not if the ValueCoder toProto 
> already encoded in a readable way.
>
> GBKVaiGBKO example, where the runner chooses a particular implementation 
> strategy. Assuming the client language is not Java so inner/outer switching 
> cannot be done unless explicitly represented in the proto.
>
>  - Input: KvCoder[Outer]<KeyCoder[Inner], ValueCoder[Outer]>
>  - Intermediate: KvCoder[Outer]<KeyCoder[Inner], 
> IterableCoder[Outer]<WindowedValueCoder[Outer]<ValueCoder[Inner]>>>
>
> Since the input ValueCoder is always Outer, it is easy to add a length-prefix 
> without knowing what it is. So from these examples, I hypothesize that the 
> hard case is when there is a user coder in Inner encoding and the runner 
> wants to turn it into Outer encoding, but the Inner encoding does not 
> explicitly indicate that it is a LengthPrefix(Outer) encoding. I believe that 
> this is not a problem - the runner can just leave it as the Inner encoding 
> since every Inner encoding is a safe Outer encoding (but not vice-versa). I 
> don't know of a real example, but if there were a runner-inserted GetKeys 
> then the KeyCoder[Inner] might *want* to change to KeyCoder[Outer] but that 
> wouldn't be a requirement.
>
> For side inputs, PCollection<ValueCoder[Outer]> the runner may want to put a 
> bunch of elements in some data structure that requires self-delimiting. Same 
> logic, it is always easy to move from Outer to Inner.

Moving an Iterable from Outer to Inner by adding a length-prefix is
really expensive, but useless as Outer is already length-delimited.
Moving a KV coder form Outer to Inner involves changing the second
component. I don't think we want to bake special rules like this into
the model.


This email is already very long, but in summary I think the right
answer is to just get rid of Outer altogether (except possibly for
IOs, which we'd only preserve for legacy reasons until 3.0). The
question remains whether we should allow a Coder C (as part of its
spec, known to all that know the URN of C) to declare an alternative
coder C' for LengthPrefix(C), that is different than e ->
varLen(size(encode(C, e))) + encode(C, e), possibly (often?) C' == C,
to avoid double-length-prefixing.

- Robert


>> >> > On Fri, Apr 5, 2019 at 12:38 AM Robert Bradshaw <[email protected]> 
>> >> > wrote:
>> >> >>
>> >> >> On Fri, Apr 5, 2019 at 12:50 AM Heejong Lee <[email protected]> wrote:
>> >> >> >
>> >> >> > Robert, does nested/unnested context work properly for Java?
>> >> >>
>> >> >> I believe so. It is similar to the bytes coder, that prefixes vs. not
>> >> >> based on the context.
>> >> >>
>> >> >> > I can see that the Context is fixed to NESTED[1] and the encode 
>> >> >> > method with the Context parameter is marked as deprecated[2].
>> >> >> >
>> >> >> > [1]: 
>> >> >> > https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
>> >> >> > [2]: 
>> >> >> > https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132
>> >> >>
>> >> >> That doesn't mean it's unused, e.g.
>> >> >>
>> >> >> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L160
>> >> >> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/LengthPrefixCoder.java#L64
>> >> >>
>> >> >> (and I'm sure there's others).
>> >> >>
>> >> >> > On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw <[email protected]> 
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> I don't know why there are two separate copies of
>> >> >> >> standard_coders.yaml--originally there was just one (though it did
>> >> >> >> live in the Python directory). I'm guessing a copy was made rather
>> >> >> >> than just pointing both to the new location, but that completely
>> >> >> >> defeats the point. I can't seem to access JIRA right now; could
>> >> >> >> someone file an issue to resolve this?
>> >> >> >>
>> >> >> >> I also think the spec should be next to the definition of the URN,
>> >> >> >> that's one of the reason the URNs were originally in a markdown file
>> >> >> >> (to encourage good documentation, literate programming style). Many
>> >> >> >> coders already have their specs there.
>> >> >> >>
>> >> >> >> Regarding backwards compatibility, we can't change existing coders,
>> >> >> >> and making new coders won't help with inference ('cause changing 
>> >> >> >> that
>> >> >> >> would also be backwards incompatible). Fortunately, I think we're
>> >> >> >> already doing the consistent thing here: In both Python and Java the
>> >> >> >> raw UTF-8 encoded bytes are encoded when used in an *unnested* 
>> >> >> >> context
>> >> >> >> and the length-prefixed UTF-8 encoded bytes are used when the coder 
>> >> >> >> is
>> >> >> >> used in a *nested* context.
>> >> >> >>
>> >> >> >> I'd really like to see the whole nested/unnested context go away, 
>> >> >> >> but
>> >> >> >> that'll probably require Beam 3.0; it causes way more confusion than
>> >> >> >> the couple of bytes it saves in a couple of places.
>> >> >> >>
>> >> >> >> - Robert
>> >> >> >>
>> >> >> >> On Thu, Apr 4, 2019 at 10:55 PM Robert Burke <[email protected]> 
>> >> >> >> wrote:
>> >> >> >> >
>> >> >> >> > My 2cents is that the "Textual description" should be part of the 
>> >> >> >> > documentation of the URNs on the Proto messages, since that's the 
>> >> >> >> > common place. I've added a short description for the varints for 
>> >> >> >> > example, and we already have lenghthier format & protocol 
>> >> >> >> > descriptions there for iterables and similar.
>> >> >> >> >
>> >> >> >> > The proto [1] *can be* the spec if we want it to be.
>> >> >> >> >
>> >> >> >> > [1]: 
>> >> >> >> > https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
>> >> >> >> >
>> >> >> >> > On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <[email protected]> 
>> >> >> >> > wrote:
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <[email protected]> 
>> >> >> >> >> wrote:
>> >> >> >> >>>
>> >> >> >> >>> We should probably move the "java" version of the yaml file [1] 
>> >> >> >> >>> to a common location rather than deep in the java hierarchy, or 
>> >> >> >> >>> copying it for Go and Python, but that can be a separate task. 
>> >> >> >> >>> It's probably non-trivial since it looks like it's part of a 
>> >> >> >> >>> java resources structure.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> Seems like /model is a good place for this if we don't want to 
>> >> >> >> >> invent a new language-independent hierarchy.
>> >> >> >> >>
>> >> >> >> >> Kenn
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>>
>> >> >> >> >>> Luke, the Go SDK doesn't currently do this validation, but it 
>> >> >> >> >>> shouldn't be difficult, given pointers to the Java and Python 
>> >> >> >> >>> variants of the tests to crib from [2]. Care would need to be 
>> >> >> >> >>> taken so that Beam Go SDK users (such as they are) aren't 
>> >> >> >> >>> forced to run them, and not have the yaml file to read. I'd 
>> >> >> >> >>> suggest putting it with the integration tests [3].
>> >> >> >> >>>
>> >> >> >> >>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK side 
>> >> >> >> >>> work. [4]
>> >> >> >> >>>
>> >> >> >> >>> 1: 
>> >> >> >> >>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>> >> >> >> >>> 2: 
>> >> >> >> >>> https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
>> >> >> >> >>> 3: https://github.com/apache/beam/tree/master/sdks/go/test
>> >> >> >> >>> 4: https://issues.apache.org/jira/browse/BEAM-7009
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <[email protected]> 
>> >> >> >> >>> wrote:
>> >> >> >> >>>>
>> >> >> >> >>>>
>> >> >> >> >>>>
>> >> >> >> >>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath 
>> >> >> >> >>>> <[email protected]> wrote:
>> >> >> >> >>>>>
>> >> >> >> >>>>>
>> >> >> >> >>>>>
>> >> >> >> >>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik 
>> >> >> >> >>>>> <[email protected]> wrote:
>> >> >> >> >>>>>>
>> >> >> >> >>>>>> standard_coders.yaml[1] is where we are currently defining 
>> >> >> >> >>>>>> these formats.
>> >> >> >> >>>>>> Unfortunately the Python SDK has its own copy[2].
>> >> >> >> >>>>>
>> >> >> >> >>>>>
>> >> >> >> >>>>> Ah great. Thanks for the pointer. Any idea why there's  a 
>> >> >> >> >>>>> separate copy for Python ? I didn't see a significant 
>> >> >> >> >>>>> difference in definitions looking at few random coders there 
>> >> >> >> >>>>> but I might have missed something. If there's no reason to 
>> >> >> >> >>>>> maintain two, we should probably unify.
>> >> >> >> >>>>> Also, seems like we haven't added the definition for UTF-8 
>> >> >> >> >>>>> coder yet.
>> >> >> >> >>>>>
>> >> >> >> >>>>
>> >> >> >> >>>> Not certain as well. I did notice the timer coder definition 
>> >> >> >> >>>> didn't exist in the Python copy.
>> >> >> >> >>>>
>> >> >> >> >>>>>>
>> >> >> >> >>>>>>
>> >> >> >> >>>>>> Here is an example PR[3] that adds the 
>> >> >> >> >>>>>> "beam:coder:double:v1" as tests to the Java and Python SDKs 
>> >> >> >> >>>>>> to ensure interoperability.
>> >> >> >> >>>>>>
>> >> >> >> >>>>>> Robert Burke, does the Go SDK have a test where it uses 
>> >> >> >> >>>>>> standard_coders.yaml and runs compatibility tests?
>> >> >> >> >>>>>>
>> >> >> >> >>>>>> Chamikara, creating new coder classes is a pain since the 
>> >> >> >> >>>>>> type -> coder mapping per SDK language would select the 
>> >> >> >> >>>>>> non-well known type if we added a new one to a language. If 
>> >> >> >> >>>>>> we swapped the default type->coder mapping, this would still 
>> >> >> >> >>>>>> break update for pipelines forcing users to update their 
>> >> >> >> >>>>>> code to select the non-well known type. If we don't change 
>> >> >> >> >>>>>> the default type->coder mapping, the well known coder will 
>> >> >> >> >>>>>> gain little usage. I think we should fix the Python coder to 
>> >> >> >> >>>>>> use the same encoding as Java for UTF-8 strings before there 
>> >> >> >> >>>>>> are too many Python SDK users.
>> >> >> >> >>>>>
>> >> >> >> >>>>>
>> >> >> >> >>>>> I was thinking that may be we should just change the default 
>> >> >> >> >>>>> UTF-8 coder for Fn API path which is experimental. Updating 
>> >> >> >> >>>>> Python to do what's done for Java is fine if we agree that 
>> >> >> >> >>>>> encoding used for Java should be the standard.
>> >> >> >> >>>>>
>> >> >> >> >>>>
>> >> >> >> >>>> That is a good idea to use the Fn API experiment to control 
>> >> >> >> >>>> which gets selected.
>> >> >> >> >>>>
>> >> >> >> >>>>>>
>> >> >> >> >>>>>>
>> >> >> >> >>>>>> 1: 
>> >> >> >> >>>>>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>> >> >> >> >>>>>> 2: 
>> >> >> >> >>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
>> >> >> >> >>>>>> 3: https://github.com/apache/beam/pull/8205
>> >> >> >> >>>>>>
>> >> >> >> >>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath 
>> >> >> >> >>>>>> <[email protected]> wrote:
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw 
>> >> >> >> >>>>>>> <[email protected]> wrote:
>> >> >> >> >>>>>>>>
>> >> >> >> >>>>>>>> A URN defines the encoding.
>> >> >> >> >>>>>>>>
>> >> >> >> >>>>>>>> There are (unfortunately) *two* encodings defined for a 
>> >> >> >> >>>>>>>> Coder (defined
>> >> >> >> >>>>>>>> by a URN), the nested and the unnested one. IIRC, in both 
>> >> >> >> >>>>>>>> Java and
>> >> >> >> >>>>>>>> Python, the nested one prefixes with a var-int length, and 
>> >> >> >> >>>>>>>> the
>> >> >> >> >>>>>>>> unnested one does not.
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>> Could you clarify where we define the exact encoding ? I 
>> >> >> >> >>>>>>> only see a URN for UTF-8 [1] while if you look at the 
>> >> >> >> >>>>>>> implementations Java includes length in the encoding [1] 
>> >> >> >> >>>>>>> while Python [1] does not.
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>> [1] 
>> >> >> >> >>>>>>> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>> >> >> >> >>>>>>> [2] 
>> >> >> >> >>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>> >> >> >> >>>>>>> [3] 
>> >> >> >> >>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>>>
>> >> >> >> >>>>>>>>
>> >> >> >> >>>>>>>> We should define the spec clearly and have cross-language 
>> >> >> >> >>>>>>>> tests.
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>> +1
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>> Regarding backwards compatibility, I agree that we should 
>> >> >> >> >>>>>>> probably not update existing coder classes. Probably we 
>> >> >> >> >>>>>>> should just standardize the correct encoding (may be as a 
>> >> >> >> >>>>>>> comment near corresponding URN in the beam_runner_api.proto 
>> >> >> >> >>>>>>> ?) and create new coder classes as needed.
>> >> >> >> >>>>>>>
>> >> >> >> >>>>>>>>
>> >> >> >> >>>>>>>>
>> >> >> >> >>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada 
>> >> >> >> >>>>>>>> <[email protected]> wrote:
>> >> >> >> >>>>>>>> >
>> >> >> >> >>>>>>>> > Could this be a backwards-incompatible change that would 
>> >> >> >> >>>>>>>> > break pipelines from upgrading? If they have data 
>> >> >> >> >>>>>>>> > in-flight in between operators, and we change the coder, 
>> >> >> >> >>>>>>>> > they would break?
>> >> >> >> >>>>>>>> > I know very little about coders, but since nobody has 
>> >> >> >> >>>>>>>> > mentioned it, I wanted to make sure we have it in mind.
>> >> >> >> >>>>>>>> > -P.
>> >> >> >> >>>>>>>> >
>> >> >> >> >>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles 
>> >> >> >> >>>>>>>> > <[email protected]> wrote:
>> >> >> >> >>>>>>>> >>
>> >> >> >> >>>>>>>> >> Agree that a coder URN defines the encoding. I see that 
>> >> >> >> >>>>>>>> >> string UTF-8 was added to the proto enum, but it needs 
>> >> >> >> >>>>>>>> >> a written spec of the encoding. Ideally some test data 
>> >> >> >> >>>>>>>> >> that different languages can use to drive compliance 
>> >> >> >> >>>>>>>> >> testing.
>> >> >> >> >>>>>>>> >>
>> >> >> >> >>>>>>>> >> Kenn
>> >> >> >> >>>>>>>> >>
>> >> >> >> >>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke 
>> >> >> >> >>>>>>>> >> <[email protected]> wrote:
>> >> >> >> >>>>>>>> >>>
>> >> >> >> >>>>>>>> >>> String UTF8 was recently added as a "standard coder " 
>> >> >> >> >>>>>>>> >>> URN in the protos, but I don't think that developed 
>> >> >> >> >>>>>>>> >>> beyond Java, so adding it to Python would be 
>> >> >> >> >>>>>>>> >>> reasonable in my opinion.
>> >> >> >> >>>>>>>> >>>
>> >> >> >> >>>>>>>> >>> The Go SDK handles Strings as "custom coders" 
>> >> >> >> >>>>>>>> >>> presently which for Go are always length prefixed (and 
>> >> >> >> >>>>>>>> >>> reported to the Runner as LP+CustomCoder). It would be 
>> >> >> >> >>>>>>>> >>> straight forward to add the correct handling for 
>> >> >> >> >>>>>>>> >>> strings, as Go natively treats strings as UTF8.
>> >> >> >> >>>>>>>> >>>
>> >> >> >> >>>>>>>> >>>
>> >> >> >> >>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee 
>> >> >> >> >>>>>>>> >>> <[email protected]> wrote:
>> >> >> >> >>>>>>>> >>>>
>> >> >> >> >>>>>>>> >>>> Hi all,
>> >> >> >> >>>>>>>> >>>>
>> >> >> >> >>>>>>>> >>>> It looks like UTF-8 String Coder in Java and Python 
>> >> >> >> >>>>>>>> >>>> SDKs uses different encoding schemes. StringUtf8Coder 
>> >> >> >> >>>>>>>> >>>> in Java SDK puts the varint length of the input 
>> >> >> >> >>>>>>>> >>>> string before actual data bytes however StrUtf8Coder 
>> >> >> >> >>>>>>>> >>>> in Python SDK directly encodes the input string to 
>> >> >> >> >>>>>>>> >>>> bytes value. For the last few weeks, I've been 
>> >> >> >> >>>>>>>> >>>> testing and fixing cross-language IO transforms and 
>> >> >> >> >>>>>>>> >>>> this discrepancy is a major blocker for me. IMO, we 
>> >> >> >> >>>>>>>> >>>> should unify the encoding schemes of UTF8 strings 
>> >> >> >> >>>>>>>> >>>> across the different SDKs and make it a standard 
>> >> >> >> >>>>>>>> >>>> coder. Any thoughts?
>> >> >> >> >>>>>>>> >>>>
>> >> >> >> >>>>>>>> >>>> Thanks,

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Reply via email to