On 9/10/06, David Hopwood <[EMAIL PROTECTED]> wrote:
Here is a very simple, reasonably (although not completely) safe, and much
more predictable guessing algorithm, based on a generalization of
<http://www.w3.org/TR/REC-xml/#sec-guessing >:

Your algorithm is more predictable but will confuse BOM-less UTF-8 with the system encoding frequently. I haven't decided in my own mind whether that trade-off is worth making. It will work well for:

 * Windows users, who will often find a BOM in their UTF-8

 * Western Unix/Linux users who will increasingly use UTF-8 as their system encoding

It will not work well for:

 * Eastern Unix/Linux users using UTF-8 apps like gedit or apps "saving as" UTF-8

 * Mac users using UTF-8 apps or saving as UTF-8.

I still haven't decided how I feel about that trade-off.

Maybe the guessing algorithm should read the WHOLE FILE. After all, we've said repeatedly that it isn't for production use so making it a bit inefficient is not a big problem and might even emphasize that point.

Modern I/O is astonishingly fast anyhow. On my computer it takes five seconds to decode a quarter gigabyte of UTF-8 text through Python. That would be a totally unacceptable waste for a production program, but for a quick hack it wouldn't be bad. And it would guarantee that you would never get an exception half-way through your parsing because of a bad character.

 Paul Prescod

_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Reply via email to