David Hopwood <[EMAIL PROTECTED]> wrote: > Here is a very simple, reasonably (although not completely) safe, and much > more predictable guessing algorithm, based on a generalization of > <http://www.w3.org/TR/REC-xml/#sec-guessing>: > > Let A, B, C, and D be the first 4 bytes of the stream, or None if the > corresponding byte is past end-of-stream. > > Let other be any encoding which is to be used as a default if no specific > UTF is detected. > > if A == 0xEF and B == 0xBB and C == 0xBF: return UTF8 > if B == None: return other > if A == 0 and B == 0 and D != None: return UTF32BE > if C == 0 and D == 0: return UTF32LE > if A == 0xFE and B == 0xFF: return UTF16BE > if A == 0xFF and B == 0xFE: return UTF16LE > if A != 0 and B != 0: return other > if A == 0: return UTF16BE > return UTF16LE > > This would normally be used with 'other' as the system encoding, as an > alternative > to just assuming that the file is in the system encoding.
Using the xml guessing mechanism is fine, as long as you get it right. A first pass with BOM detection and a second pass to "guess" based on content in the case that a BOM isn't detected seems to make sense. Note that the above algorithm returns UTF32BE for a files beginning with 4 null bytes. - Josiah _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
