Here is a very simple, reasonably (although not completely) safe, and much
more predictable guessing algorithm, based on a generalization of
<http://www.w3.org/TR/REC-xml/#sec-guessing >:
Your algorithm is more predictable but will confuse BOM-less UTF-8 with the system encoding frequently. I haven't decided in my own mind whether that trade-off is worth making. It will work well for:
* Windows users, who will often find a BOM in their UTF-8
* Western Unix/Linux users who will increasingly use UTF-8 as their system encoding
It will not work well for:
* Eastern Unix/Linux users using UTF-8 apps like gedit or apps "saving as" UTF-8
* Mac users using UTF-8 apps or saving as UTF-8.
I still haven't decided how I feel about that trade-off.
Maybe the guessing algorithm should read the WHOLE FILE. After all, we've said repeatedly that it isn't for production use so making it a bit inefficient is not a big problem and might even emphasize that point.
Modern I/O is astonishingly fast anyhow. On my computer it takes five seconds to decode a quarter gigabyte of UTF-8 text through Python. That would be a totally unacceptable waste for a production program, but for a quick hack it wouldn't be bad. And it would guarantee that you would never get an exception half-way through your parsing because of a bad character.
Paul Prescod
_______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
