Kalle Niemitalo created AVRO-3572:
-------------------------------------

             Summary: Python encodes default value of bytes field as UTF-8
                 Key: AVRO-3572
                 URL: https://issues.apache.org/jira/browse/AVRO-3572
             Project: Apache Avro
          Issue Type: Bug
          Components: python
    Affects Versions: 1.11.0
         Environment: Python 3.9.2
            Reporter: Kalle Niemitalo


The Avro spec says

bq. Default values for bytes and fixed fields are JSON strings, where Unicode 
code points 0-255 are mapped to unsigned 8-bit byte values 0-255.

but in the Avro library for Python, [_read_default_value calls 
str.encode|https://github.com/apache/avro/blob/release-1.11.0/lang/py/avro/io.py#L958-L959]
 to convert the JSON string to bytes, and [str.encode in Python 
3|https://docs.python.org/3/library/stdtypes.html#str.encode] uses UTF-8 by 
default. So, this miscodes bytes 0x80 and higher. For example, the JSON string 
"\u0080" becomes two bytes b'\xc2\x80' even though it should become one byte 
b'\x80'.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to