Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Heejong Lee Thu, 04 Apr 2019 12:09:00 -0700

On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <[email protected]>
wrote:


>
>
> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <[email protected]>
> wrote:
>
>> A URN defines the encoding.
>>
>> There are (unfortunately) *two* encodings defined for a Coder (defined
>> by a URN), the nested and the unnested one. IIRC, in both Java and
>> Python, the nested one prefixes with a var-int length, and the
>> unnested one does not.
>>
>
> Could you clarify where we define the exact encoding ? I only see a URN
> for UTF-8 [1] while if you look at the implementations Java includes length
> in the encoding [1] while Python [1] does not.
>
> [1]
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
> [3]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>
>
>
>>
>> We should define the spec clearly and have cross-language tests.
>>
>
> +1
>
> Regarding backwards compatibility, I agree that we should probably not
> update existing coder classes. Probably we should just standardize the
> correct encoding (may be as a comment near corresponding URN in the
> beam_runner_api.proto ?) and create new coder classes as needed.
>

Then how do we pair the type and the coder? For Java we can explicitly
assign a specific coder to PCollection but for python a coder is inferred
from an element type of PCollection. If we create another standard coder
for utf-8 string, would that new coder be a default for the string element
type?


>
>
>>
>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <[email protected]> wrote:
>> >
>> > Could this be a backwards-incompatible change that would break
>> pipelines from upgrading? If they have data in-flight in between operators,
>> and we change the coder, they would break?
>> > I know very little about coders, but since nobody has mentioned it, I
>> wanted to make sure we have it in mind.
>> > -P.
>> >
>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <[email protected]> wrote:
>> >>
>> >> Agree that a coder URN defines the encoding. I see that string UTF-8
>> was added to the proto enum, but it needs a written spec of the encoding.
>> Ideally some test data that different languages can use to drive compliance
>> testing.
>> >>
>> >> Kenn
>> >>
>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <[email protected]>
>> wrote:
>> >>>
>> >>> String UTF8 was recently added as a "standard coder " URN in the
>> protos, but I don't think that developed beyond Java, so adding it to
>> Python would be reasonable in my opinion.
>> >>>
>> >>> The Go SDK handles Strings as "custom coders" presently which for Go
>> are always length prefixed (and reported to the Runner as LP+CustomCoder).
>> It would be straight forward to add the correct handling for strings, as Go
>> natively treats strings as UTF8.
>> >>>
>> >>>
>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <[email protected]> wrote:
>> >>>>
>> >>>> Hi all,
>> >>>>
>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses
>> different encoding schemes. StringUtf8Coder in Java SDK puts the varint
>> length of the input string before actual data bytes however StrUtf8Coder in
>> Python SDK directly encodes the input string to bytes value. For the last
>> few weeks, I've been testing and fixing cross-language IO transforms and
>> this discrepancy is a major blocker for me. IMO, we should unify the
>> encoding schemes of UTF8 strings across the different SDKs and make it a
>> standard coder. Any thoughts?
>> >>>>
>> >>>> Thanks,
>>
>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Reply via email to