Paul Prescod wrote: > On 9/10/06, David Hopwood <[EMAIL PROTECTED]> wrote: > >> ... if you think that guessing based on content is a good idea -- I >> don't. In any case, such guessing necessarily depends on the expected file >> format, so it should be done by the application itself, or by a library that >> knows more about the format. > > I disagree. If a non-trivial file can be decoded as a UTF-* encoding > it probably is that encoding.
That is quite false for UTF-16, at least. It is also false for short UTF-8 files. > I don't see how it matters whether the > file represents Latex or an .htaccess file. XML is a special case > because it is specially designed to make encoding detection (not > guessing, but detection) easy. Many other frequently used formats also necessarily start with an ASCII character and do not contain NULs, which is at least sufficient to reliably detect UTF-16 and UTF-32. >> If the encoding of a text stream were settable after it had been opened, >> then it would be easy for anyone to implement whatever guessing algorithm >> they needed, without having to write an encoding implementation or >> include any other support for guessing in the I/O library itself. > > But this defeats the whole purpose of the PEP which is to accelerate > the writing of quick and dirty text processing scripts. That doesn't justify making the behaviour of those scripts "dirtier" than necessary. I think that the focus should be on solving a set of well-defined problems, for which BOM detection can definitely help: Suppose we have a system in which some of the files are in a potentially non-Unicode 'system' encoding, and some are Unicode. The user of the system needs a reliable way of marking the Unicode files so that the encoding of *those* files can be distinguished. In addition, a provider of portable software or documentation needs a way to encode files for distribution that is independent of the system encoding, since (before run-time) they don't know what encoding that will on any given system. Use and detection of Byte Order Marks solves both of these problems. You appear to be arguing for the common use of much more ambitious heuristic guessing, which *cannot* be made reliable. I am not opposed to providing support for such guessing in the Python standard library, but only if its limitations are thoroughly documented, and only if it is not the default. -- David Hopwood <[EMAIL PROTECTED]> _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
