Re: [Python-Dev] Unicode byte order mark decoding

Walter DÃ¶rwald Thu, 07 Apr 2005 14:32:33 -0700

Nicholas Bastin sagte:

> On Apr 7, 2005, at 11:35 AM, M.-A. Lemburg wrote:
>
> [...]
>> If you do have UTF-16 without a BOM mark it's much better
>> to let a short function analyze the text by reading for first
>> few bytes of the file and then make an educated guess based
>> on the findings. You can then process the file using one
>> of the other codecs UTF-16-LE or -BE.
>
> This is about what we do now - we catch UnicodeError and
> then add a BOM  to the file, and read it again.  We know
> our files are UTF-16BE if they  don't have a BOM, as the
> files are written by code which observes the  spec.
> We can't use UTF-16BE all the time, because sometimes
> they're UTF-16LE, and in those cases the BOM is set.
>
> It would be nice if you could optionally specify that the
> codec would assume UTF-16BE if no BOM was present,
> and not raise UnicodeError in  that case, which would
> preserve the current behaviour as well as allow users'
> to ask for behaviour which conforms to the standard.


It should be feasible to implement your own codec for that
based on Lib/encodings/utf_16.py. Simply replace the line
in StreamReader.decode():
   raise UnicodeError,"UTF-16 stream does not start with BOM"
with:
   self.decode = codecs.utf_16_be_decode
and you should be done.

> [...]

Bye,
   Walter Dörwald



_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode byte order mark decoding

Reply via email to