Re: [Python-3000] Pre-PEP: Easy Text File Decoding

Josiah Carlson Sun, 10 Sep 2006 11:22:40 -0700

David Hopwood <[EMAIL PROTECTED]> wrote:
> Here is a very simple, reasonably (although not completely) safe, and much
> more predictable guessing algorithm, based on a generalization of
> <http://www.w3.org/TR/REC-xml/#sec-guessing>:
> 
>    Let A, B, C, and D be the first 4 bytes of the stream, or None if the
>      corresponding byte is past end-of-stream.
> 
>    Let other be any encoding which is to be used as a default if no specific
>      UTF is detected.
> 
>    if A == 0xEF and B == 0xBB and C == 0xBF: return UTF8
>    if B == None: return other
>    if A == 0 and B == 0 and D != None: return UTF32BE
>    if C == 0 and D == 0: return UTF32LE
>    if A == 0xFE and B == 0xFF: return UTF16BE
>    if A == 0xFF and B == 0xFE: return UTF16LE
>    if A != 0 and B != 0: return other
>    if A == 0: return UTF16BE
>    return UTF16LE
> 
> This would normally be used with 'other' as the system encoding, as an 
> alternative
> to just assuming that the file is in the system encoding.


Using the xml guessing mechanism is fine, as long as you get it right. 
A first pass with BOM detection and a second pass to "guess" based on
content in the case that a BOM isn't detected seems to make sense.

Note that the above algorithm returns UTF32BE for a files beginning with
4 null bytes.

 - Josiah

_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Re: [Python-3000] Pre-PEP: Easy Text File Decoding

Reply via email to