[ 
https://issues.apache.org/jira/browse/AVRO-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Karp updated AVRO-1517:
----------------------------

    Description: 
By default in perl, a string is a sequence of bytes, values 0-255. However, if 
a Unicode character is included that cannot be represented with a single byte, 
the string gets 'upgraded' to a non-byte-based Unicode string allowing ordinals 
outside that range. When string operations are done with byte and non-byte 
Unicode strings, the result is always non-byte, with the byte string first 
'upgraded'. Upgrading consists of utf8 encoding and setting a utf8 flag on the 
string. ('utf8' is a variant of UTF-8 used by perl)

The perl Avro API is accepting these Unicode strings as-is for the 'bytes' 
type. This is a problem because 1) values >255 are not valid as bytes, and any 
encoding is their job. 2) As Avro assembles the serialized data, perl 
'upgrades' all the data, having the effect of utf8 encoding our serialized 
binary data.

The correct behavior is for the Avro perl API is to attempt to downgrade the 
string, and if this fails because of contained values >255 then to raise an 
error. (The behavior of 'string' won't change, it will still take Unicode 
strings as expected.)

  was:
By default in perl, a string is a sequence of bytes, values 0-255. However, if 
a Unicode character is included that cannot be represented with a single byte, 
the string gets 'upgraded' to a non-byte-based Unicode string allowing ordinals 
outside that range. When string operations are done with byte and non-byte 
Unicode strings, the result is always non-byte, with the byte string first 
'upgraded'. Upgrading consists of utf8 encoding and setting a utf8 flag on the 
string. ('utf8' is a variant of UTF-8 used by perl)

The perl Avro API is accepting these Unicode strings as-is for the 'bytes' 
type. This is a problem because 1) values >255 are not valid as bytes, and any 
encoding is their job. 2) As Avro assembles the serialized data, perl 
'upgrades' all the data, having the effect of utf8 encoding our serialized 
binary data.

The correct behavior is for the Avro perl API is to attempt to downgrade te 
strings, and if this fails because of values >255 then to raise an error. (The 
behavior of 'string' won't change, it will still take Unicode strings as 
expected.)


> Unicode strings are accepted as bytes type by perl API
> ------------------------------------------------------
>
>                 Key: AVRO-1517
>                 URL: https://issues.apache.org/jira/browse/AVRO-1517
>             Project: Avro
>          Issue Type: Bug
>          Components: perl
>            Reporter: John Karp
>            Assignee: John Karp
>         Attachments: AVRO-1517-0.patch
>
>
> By default in perl, a string is a sequence of bytes, values 0-255. However, 
> if a Unicode character is included that cannot be represented with a single 
> byte, the string gets 'upgraded' to a non-byte-based Unicode string allowing 
> ordinals outside that range. When string operations are done with byte and 
> non-byte Unicode strings, the result is always non-byte, with the byte string 
> first 'upgraded'. Upgrading consists of utf8 encoding and setting a utf8 flag 
> on the string. ('utf8' is a variant of UTF-8 used by perl)
> The perl Avro API is accepting these Unicode strings as-is for the 'bytes' 
> type. This is a problem because 1) values >255 are not valid as bytes, and 
> any encoding is their job. 2) As Avro assembles the serialized data, perl 
> 'upgrades' all the data, having the effect of utf8 encoding our serialized 
> binary data.
> The correct behavior is for the Avro perl API is to attempt to downgrade the 
> string, and if this fails because of contained values >255 then to raise an 
> error. (The behavior of 'string' won't change, it will still take Unicode 
> strings as expected.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to