Antoine Pitrou wrote: > Hi, > > Le mercredi 13 septembre 2006 à 16:14 -0700, Josiah Carlson a écrit : >> In any case, I believe that the above behavior is correct for the >> context. Why? Because utf-8 has no endianness, its 'generic' decoding >> spelling of 'utf-8' is analagous to all three 'utf-16', 'utf-16-be', and >> 'utf-16-le' decoding spellings; two of which don't strip. > > Your opinion is probably valid in a theoretical point of view. You are > more knowledgeable than me. > > My point was different : most programmers are not at your level (or > Paul's level, etc.) when it comes to Unicode knowledge. Py3k's str type > is supposed to be an abstracted textual type to make it easy to write > unicode-friendly applications (isn't it?). > Therefore it should hide the messy issue of superfluous BOMs, unwanted > BOMs, etc. Telling the programmer to use a specific UTF-8 variant > specialized in BOM-stripping will make eyes roll... "why doesn't the > standard UTF-8 do it for me?"
I've been reading this thread (and the ones that spawned it), and there's something about it that's been nagging at me for a while, which I am going to attempt to articulate. The basic controversy centers around the various ways in which Python should attempt to deal with character encodings on various platforms, but my question is "for what use cases?" To my mind, trying to ask "how should we handle character encoding" without indicating what we want to use the characters *for* is a meaningless question. From the standpoint of a programmer writing code to process file contents, there's really no such thing as a "text file" - there are only various text-based file formats. There are XML files, .ini files, email messages and Python source code, all of which need to be processed differently. So when one asks "how do I handle text files", my response is "there ain't no such thing" -- and when you ask "well, ok, how do I handle text-based file formats", my response is "well it depends on the format". Yes, there are some operations which can operate on textual data regardless of file format (i.e. grep), but these generic operations are so basic and uninteresting that one generally doesn't need to write Python code to do them. And even the case of simple unix utilities such as 'cat', *some* a priori knowledge of the file's encoded meaning is required - you can't just concatenate two XML files and get anything meaningful or valid. Running 'sort' on Python source code is unlikely to increase shareholder value or otherwise hold back the tide of entropy. Any given Python program that I write is going to know *something* about the format of the files that it is supposed to read/write, and the most important consideration is knowledge of what kinds of other programs are going to produce or consume that file. If the file that I am working with conforms to a standard (so that the number of producer/consumer programs can be large without me having to know the specific details of each one) then I need to understand that standard and constraints of what is legal within it. For files with any kind of structure in them, common practice is that we don't treat them as streams of characters, rather we generally have some abstraction layer that sits on top of the character stream and allows us to work with the structure directly. Thus, when dealing with XML one generally uses something like ElementTree, and in fact manipulating XML files as straight text is actively discouraged. So my whole approach to the problem of reading and writing is to come up with a collection of APIs that reflect the common use patterns for the various popular file types. The benefit of doing this is that you don't waste time thinking about all of the various file operations that don't apply to a particular file format. For example, using the ElementTree interface, I don't care whether the underlying file stream supports seek() or not - generally one doesn't seek into the middle of an XML, so there's no need to support that feature. On the other hand, if one is reading a bdb file, one needs to seek to the location of a record in order to read it - but in such a case, the result of the seek operation is well-defined. I don't have to spend time discussing what will happen if I seek into the middle of an encoded multi-byte character, because with a bdb file, that can't happen. It seems to me that a lot of the conundrums that have been discussed in this thread have to do with hypothetical use cases - 'Well, what if I use operation X on a file of format Y, for which the result is undefined?' My answer is "Don't do that." -- Talin _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
