[ 
https://issues.apache.org/jira/browse/AVRO-3572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564119#comment-17564119
 ] 

Kalle Niemitalo commented on AVRO-3572:
---------------------------------------

It would be good to have a test with a record schema in which the default value 
of a "bytes" field contains all 256 code units from "\u0000" to "\u00ff". The 
same for a "fixed" field as well. Then read a record that was written using a 
schema that lacks those fields, and assert that the resulting values of the 
fields are correct. That would show that the transformation from string to 
bytes handles all valid bytes correctly.

> Python encodes default value of bytes field as UTF-8
> ----------------------------------------------------
>
>                 Key: AVRO-3572
>                 URL: https://issues.apache.org/jira/browse/AVRO-3572
>             Project: Apache Avro
>          Issue Type: Bug
>          Components: python
>    Affects Versions: 1.11.0
>         Environment: Python 3.9.2
>            Reporter: Kalle Niemitalo
>            Priority: Minor
>
> The Avro spec says
> bq. Default values for bytes and fixed fields are JSON strings, where Unicode 
> code points 0-255 are mapped to unsigned 8-bit byte values 0-255.
> but in the Avro library for Python, [_read_default_value calls 
> str.encode|https://github.com/apache/avro/blob/release-1.11.0/lang/py/avro/io.py#L958-L959]
>  to convert the JSON string to bytes, and [str.encode in Python 
> 3|https://docs.python.org/3/library/stdtypes.html#str.encode] uses UTF-8 by 
> default. So, this miscodes bytes 0x80 and higher. For example, the JSON 
> string "\u0080" becomes two bytes b'\xc2\x80' even though it should become 
> one byte b'\x80'.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to