Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Robert Bradshaw Thu, 04 Apr 2019 11:30:16 -0700

A URN defines the encoding.

There are (unfortunately) *two* encodings defined for a Coder (defined
by a URN), the nested and the unnested one. IIRC, in both Java and
Python, the nested one prefixes with a var-int length, and the
unnested one does not.


We should define the spec clearly and have cross-language tests.

On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <[email protected]> wrote:
>
> Could this be a backwards-incompatible change that would break pipelines from 
> upgrading? If they have data in-flight in between operators, and we change 
> the coder, they would break?
> I know very little about coders, but since nobody has mentioned it, I wanted 
> to make sure we have it in mind.
> -P.
>
> On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <[email protected]> wrote:
>>
>> Agree that a coder URN defines the encoding. I see that string UTF-8 was 
>> added to the proto enum, but it needs a written spec of the encoding. 
>> Ideally some test data that different languages can use to drive compliance 
>> testing.
>>
>> Kenn
>>
>> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <[email protected]> wrote:
>>>
>>> String UTF8 was recently added as a "standard coder " URN in the protos, 
>>> but I don't think that developed beyond Java, so adding it to Python would 
>>> be reasonable in my opinion.
>>>
>>> The Go SDK handles Strings as "custom coders" presently which for Go are 
>>> always length prefixed (and reported to the Runner as LP+CustomCoder). It 
>>> would be straight forward to add the correct handling for strings, as Go 
>>> natively treats strings as UTF8.
>>>
>>>
>>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <[email protected]> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> It looks like UTF-8 String Coder in Java and Python SDKs uses different 
>>>> encoding schemes. StringUtf8Coder in Java SDK puts the varint length of 
>>>> the input string before actual data bytes however StrUtf8Coder in Python 
>>>> SDK directly encodes the input string to bytes value. For the last few 
>>>> weeks, I've been testing and fixing cross-language IO transforms and this 
>>>> discrepancy is a major blocker for me. IMO, we should unify the encoding 
>>>> schemes of UTF8 strings across the different SDKs and make it a standard 
>>>> coder. Any thoughts?
>>>>
>>>> Thanks,

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Reply via email to