[
https://issues.apache.org/jira/browse/AVRO-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Martin Tzvetanov Grigorov updated AVRO-1517:
--------------------------------------------
Fix Version/s: 1.12.0
(was: 1.9.0)
Assignee: José Joaquín Atria (was: John Karp)
Resolution: Fixed
Status: Resolved (was: Patch Available)
> Unicode strings are accepted as bytes and fixed type by perl API
> ----------------------------------------------------------------
>
> Key: AVRO-1517
> URL: https://issues.apache.org/jira/browse/AVRO-1517
> Project: Apache Avro
> Issue Type: Bug
> Components: perl
> Reporter: John Karp
> Assignee: José Joaquín Atria
> Priority: Major
> Fix For: 1.12.0
>
> Attachments: AVRO-1517.patch
>
>
> By default in perl, a string is a sequence of bytes, values 0-255. However,
> if a Unicode character is included that cannot be represented with a single
> byte, the string gets 'upgraded' to a non-byte-based Unicode string allowing
> ordinals outside that range. When string operations are done with byte and
> non-byte Unicode strings, the result is always non-byte, with the byte string
> first 'upgraded'. Upgrading consists of utf8 encoding and setting a utf8 flag
> on the string. ('utf8' is a variant of UTF-8 used by perl)
> The perl Avro API is accepting these Unicode strings as-is for the 'bytes'
> type. This is a problem because 1) values >255 are not valid as bytes, and
> any encoding is their job. 2) As Avro assembles the serialized data, perl
> 'upgrades' all the data, having the effect of utf8 encoding our serialized
> binary data.
> The correct behavior is for the Avro perl API is to attempt to downgrade the
> string, and if this fails because of contained values >255 then to raise an
> error. (The behavior of 'string' won't change, it will still take Unicode
> strings as expected.)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)