David Hopwood <[EMAIL PROTECTED]> wrote: > Josiah Carlson wrote: [snip] > > Using the xml guessing mechanism is fine, as long as you get it right. > > A first pass with BOM detection and a second pass to "guess" based on > > content in the case that a BOM isn't detected seems to make sense. > > ... if you think that guessing based on content is a good idea -- I don't. > In any case, such guessing necessarily depends on the expected file format, > so it should be done by the application itself, or by a library that knows > more about the format.
I'm keeping my hat out of the ring for whether guessing is a good idea. However, if one is going to have a guessing mechanic, starting with UTF BOMS is a good start, which is what I was trying to say. > If the encoding of a text stream were settable after it had been opened, > then it would be easy for anyone to implement whatever guessing algorithm > they needed, without having to write an encoding implementation or include > any other support for guessing in the I/O library itself. That is true. But considering that you, presumably an experienced programmer with regards to unicode, have provided an algorithm with an obvious hole that I was able to discover in a few moments, suggests that guessing algorithms are not easy to write. > (This also requires the ability to seek back to the beginning of the stream > after reading the data needed for the guess.) > > > Note that the above algorithm returns UTF32BE for a files beginning with > > 4 null bytes. > > Yes. But such a thing probably isn't a text file at all -- in which case > there will be subsequent decoding errors when most of the code units are > not in the range 0 to 0x10FFFF. A file starting with 4 nulls certainly will likely imply a non-text file of some kind, but presuming that "most" code points would not be in the 0...0x10ffff range is a bit of assumption about the content of a file. I thought you didn't want to guess. - Josiah _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
