Nicholas Bastin wrote: > > On Apr 5, 2005, at 6:19 AM, M.-A. Lemburg wrote: > >> Note that the UTF-16 codec is strict w/r to the presence >> of the BOM mark: you get a UnicodeError if a stream does >> not start with a BOM mark. For the UTF-8-SIG codec, this >> should probably be relaxed to not require the BOM. > > > I've actually been confused about this point for quite some time now, > but never had a chance to bring it up. I do not understand why > UnicodeError should be raised if there is no BOM. I know that PEP-100 > says: > > 'utf-16': 16-bit variable length encoding (little/big endian) > > and: > > Note: 'utf-16' should be implemented by using and requiring byte order > marks (BOM) for file input/output. > > But this appears to be in error, at least in the current unicode > standard. 'utf-16', as defined by the unicode standard, is big-endian > in the absence of a BOM: > > --- > 3.10.D42: UTF-16 encoding scheme: > ... > * The UTF-16 encoding scheme may or may not begin with a BOM. However, > when there is no BOM, and in the absence of a higher-level protocol, the > byte order of the UTF-16 encoding scheme is big-endian. > ---
The problem is "in the absence of a higher level protocol": the codec doesn't know anything about a protocol - it's the application using the codec that knows which protocol get's used. It's a lot safer to require the BOM for UTF-16 streams and raise an exception to have the application decide whether to use UTF-16-BE or the by far more common UTF-16-LE. Unlike for the UTF-8 codec, the BOM for UTF-16 is a configuration parameter, not merely a signature. In terms of history, I don't recall whether your quote was already in the standard at the time I wrote the PEP. You are the first to have reported a problem with the current implementation (which has been around since 2000), so I believe that application writers are more comfortable with the way the UTF-16 codec is currently implemented. Explicit is better than implicit :-) > The current implementation of the utf-16 codecs makes for some > irritating gymnastics to write the BOM into the file before reading it > if it contains no BOM, which seems quite like a bug in the codec. The codec writes a BOM in the first call to .write() - it doesn't write a BOM before reading from the file. > I allow for the possibility that this was ambiguous in the standard when > the PEP was written, but it is certainly not ambiguous now. See above. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 07 2005) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! :::: _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com