New submission from Daniel Blanchard:
As I recently discovered when someone filed a PR on chardet (see
https://github.com/chardet/chardet/issues/70), BOMs are handled are not handled
correctly by the endian-specific encodings UTF-16LE, UTF-16BE, UTF-32LE, and
UTF-32BE, but are by the UTF-16 and UTF-32 encodings.
For example:
>>> 'foo'.encode('utf-16le')
b'f\x00o\x00o\x00'
>>> 'foo'.encode('utf-16')
b'\xff\xfef\x00o\x00o\x00'
You can see that when using UTF-16 (instead of UTF-16LE), you get the BOM
correctly prepended to the bytes.
If you were on a little endian system and purposefully wanted to create a
UTF-16BE file, the only way to do it is:
>>> codecs.BOM_UTF16_BE + 'foo'.encode('utf-16be')
b'\xfe\xff\x00f\x00o\x00o'
This doesn't make a lot of sense to me. Why is the BOM not prepended
automatically when encoding with UTF-16BE?
Furthermore, if you were given a UTF-16BE file on a little endian system, you
might think that this would be the correct way to decode it:
>>> (codecs.BOM_UTF16_BE + 'foo'.encode('utf-16be')).decode('utf-16be')
'\ufefffoo'
but as you can see that leaves the BOM on there. Strangely, decoding with
UTF-16 works fine however:
>>> (codecs.BOM_UTF16_BE + 'foo'.encode('utf-16be')).decode('utf-16')
'foo'
It seems to me that the endian-specific versions of UTF-16 and UTF-32 should be
adding/removing the appropriate BOMs, and this is a long-standing bug.
----------
components: Unicode
messages: 252406
nosy: Daniel.Blanchard, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE encodings don't add/remove
BOM on encode/decode
versions: Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue25325>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com