Le dimanche 10 septembre 2006 à 12:02 -0700, Paul Prescod a écrit : > Your algorithm is more predictable but will confuse BOM-less UTF-8 > with the system encoding frequently.
I don't think it is desirable to acknowledge only some kinds of UTF-8. It will confuse the hell out of programmers, and users. I'm not sure full-blown statistical analysis is necessary anyway. There should be an ordered list of detectable encodings, which realistically would be [all unicode variants, system default]. Then if you have a file which is syntactically valid UTF-8, it most likely /is/ UTF-8 and not ISO-8859-1 (for example). > Modern I/O is astonishingly fast anyhow. On my computer it takes five > seconds to decode a quarter gigabyte of UTF-8 text through Python. Maybe we shouldn't be that presomptuous. Modern I/O is fast but memory is not infinite. That quarter gigabyte will have swapped out other data/code in order to make some place in the filesystem cache. Also, Python is often used on more modest hardware. Regards Antoine. _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
