Nicholas Bastin sagte: > On Apr 7, 2005, at 11:35 AM, M.-A. Lemburg wrote: > > [...] >> If you do have UTF-16 without a BOM mark it's much better >> to let a short function analyze the text by reading for first >> few bytes of the file and then make an educated guess based >> on the findings. You can then process the file using one >> of the other codecs UTF-16-LE or -BE. > > This is about what we do now - we catch UnicodeError and > then add a BOM to the file, and read it again. We know > our files are UTF-16BE if they don't have a BOM, as the > files are written by code which observes the spec. > We can't use UTF-16BE all the time, because sometimes > they're UTF-16LE, and in those cases the BOM is set. > > It would be nice if you could optionally specify that the > codec would assume UTF-16BE if no BOM was present, > and not raise UnicodeError in that case, which would > preserve the current behaviour as well as allow users' > to ask for behaviour which conforms to the standard.
It should be feasible to implement your own codec for that based on Lib/encodings/utf_16.py. Simply replace the line in StreamReader.decode(): raise UnicodeError,"UTF-16 stream does not start with BOM" with: self.decode = codecs.utf_16_be_decode and you should be done. > [...] Bye, Walter Dörwald _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com