On 9/11/06, David Hopwood <[EMAIL PROTECTED]> wrote: > > I disagree. If a non-trivial file can be decoded as a UTF-* encoding > > it probably is that encoding. > > That is quite false for UTF-16, at least. It is also false for short UTF-8 > files.
True UTF-16 (as opposed to UTF-16 BE/UTF 16 LE) files have a BOM. Also, you can recognize incorrect ones through misuse of surrogates. > > I don't see how it matters whether the > > file represents Latex or an .htaccess file. XML is a special case > > because it is specially designed to make encoding detection (not > > guessing, but detection) easy. > > Many other frequently used formats also necessarily start with an ASCII > character and do not contain NULs, which is at least sufficient to reliably > detect UTF-16 and UTF-32. Yes, but these are the two easiest ones. > > But this defeats the whole purpose of the PEP which is to accelerate > > the writing of quick and dirty text processing scripts. > > That doesn't justify making the behaviour of those scripts "dirtier" than > necessary. > > I think that the focus should be on solving a set of well-defined problems, > for which BOM detection can definitely help: > > Suppose we have a system in which some of the files are in a potentially > non-Unicode 'system' encoding, and some are Unicode. The user of the system > needs a reliable way of marking the Unicode files so that the encoding of > *those* files can be distinguished. If the user understands the problem and is willing to go to this level of effort then they are not the target user of the feature. > ... In addition, a provider of portable > software or documentation needs a way to encode files for distribution that > is independent of the system encoding, since (before run-time) they don't > know what encoding that will on any given system. Use and detection of > Byte Order Marks solves both of these problems. Sure, that's great. > You appear to be arguing for the common use of much more ambitious heuristic > guessing, which *cannot* be made reliable. First, the word "guess" necessarily implies unreliability. Guido started this whole chain of discussion when he said: "(Auto-detection from sniffing the data is a perfectly valid answer BTW -- I see no reason why that couldn't be one option, as long as there's a way to disable it.)" > ... I am not opposed to providing > support for such guessing in the Python standard library, but only if its > limitations are thoroughly documented, and only if it is not the default. Those are both characteristics of the proposal that started this thread so what are we arguing about? Since writing the PEP, I've noticed that the strategy of trying to decode as UTF-* and falling back to an 8-bit character set is actually pretty common in text editors, which implies that Python's behaviour here can be highly similar to text editors. This was the key requirement Guido gave me in an off-list email for the guessing mode. VIM: "fileencodings: This is a list of character encodings considered when starting to edit a file. When a file is read, Vim tries to use the first mentioned character encoding. If an error is detected, the next one in the list is tried. When an encoding is found that works, 'fileencoding' is set to it. " Reading the docs, one can infer that this feature is specifically designed to support UTF-8 sniffing. I would guess that the default configuration has it do UTF-8 sniffing. BBEdit: "If the file contains no other cues to indicate its text encoding, and its contents appear to be valid UTF-8, BBEdit will open it as UTF-8 (No BOM) without recourse to the preferences option." Paul Prescod _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
