[jira] [Updated] (AVRO-1517) Unicode strings are accepted as bytes type by perl API

John Karp (JIRA) Fri, 23 May 2014 09:53:39 -0700

     [ 
https://issues.apache.org/jira/browse/AVRO-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


John Karp updated AVRO-1517:
----------------------------

    Attachment: AVRO-1517-0.patch

attaching patch

> Unicode strings are accepted as bytes type by perl API
> ------------------------------------------------------
>
>                 Key: AVRO-1517
>                 URL: https://issues.apache.org/jira/browse/AVRO-1517
>             Project: Avro
>          Issue Type: Bug
>          Components: perl
>            Reporter: John Karp
>            Assignee: John Karp
>         Attachments: AVRO-1517-0.patch
>
>
> By default in perl, a string is a sequence of bytes, values 0-255. However, 
> if a Unicode character is included that cannot be represented with a single 
> byte, the string gets 'upgraded' to a non-byte-based Unicode string allowing 
> ordinals outside that range. When string operations are done with byte and 
> non-byte Unicode strings, the result is always non-byte, with the byte string 
> first 'upgraded'. Upgrading consists of utf8 encoding and setting a utf8 flag 
> on the string. ('utf8' is a variant of UTF-8 used by perl)
> The perl Avro API is accepting these Unicode strings as-is for the 'bytes' 
> type. This is a problem because 1) bytes and Unicode characters are not 
> interchangeable, and if the user declares they are going to provide bytes 
> they should provide bytes; any encoding is their job. 2) As Avro assembles 
> the serialized data, perl 'upgrades' all the data, having the effect of utf8 
> encoding our serialized binary data.
> The correct behavior is for the Avro perl API to raise an error when encoding 
> 'bytes' and a Unicode string has been provided. (The behavior of 'string' 
> won't change, it will still take Unicode strings as expected.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (AVRO-1517) Unicode strings are accepted as bytes type by perl API

Reply via email to