Robin Becker schrieb: > I'm in the process of attempting a straightforward port of a relatively > simple package which does most of its work by writing out files with a > more or less complicated set of possible encodings. So far I have used > all the 2to3 tools and a lot of effort, but still don't have a working > version. This must be the worst way to convert people to unicode. When > tcl went through this they chose the eminently sensible route of not > choosing a separate unicode type (they used utf8 byte strings instead). > Not only has python chosen to burden itself with two string types, but > with 3 they've swapped roles. This is certainly the first time I've had > to decide on an encoding before writing simple text to a file.
Which is the EXACT RIGHT THING TO DO! see below. > > Of course we may end up with a better language, but it will be a > worse(more complex) tool for many simple tasks. Using a complex writing > with many glyphs costs effort no matter how you do it, but I just use > ascii :( and it's still an effort. > > I find the differences in C/OS less hard to understand than why I need > bytes(x,'encoding') everywhere I just used to use str(x). If you google my name + unicode, you see that I'm often answering questions regarding unicode. I wouldn't say I'm a recognized expert on the subject, but I certainly do know enough to deal with it whenever I encounter it. And from my experience with the problems in general, and specificly in python, as well as trying to help others I can say that: - 95% of the times, the problem is in front of the keyboard. - programmers stubbornly refuse to *learn* what unicode is, and what an encoding is, and what role utf-8 plays. Instead, the resort to a voodoo-approach of throwing in various encode/decode-calls + a good deal of cat's feces in hope of wriggling themselves out of the problem. - it is NOT sensible to use utf8 as unicode-"type" - that is as bad as it can get because you don't see the errors, but instead mangle your data and end up with a byte-string-mess. If that is your road to heaven, by all means chose it - and don't use unicode at all. and be prepared for damnation :) If your programs worked for now, but don't do anymore because of Py3K introducing mandatory unicode-objects for string-literals it pretty much follows that they *seem* to work, but very, very probably fail in the face of actual i18nized data. The *only* sensible thing to do is follow these simple rules - and these apply with python 2.x, and will be enforced by 3k which is a good thing IMHO: - when you read data from somewhere, make sure you know which encoding it has, and *immediatly* convert it to unicode - when you write data, make sure you know which encoding you want it to have (in doubt, chose utf-8 to prevent loss of data) and apply it. - XML-parsers take byte-strings & spit out unicode. Period. I neither want to imply that you are an Idiot nor that unicode doesn't have it's complexities. And I'd love to say that Python wouldn't add to these by having two string-types. But the *real* problem is that it used to have only bytestrings, and finally Py3K will solve that issue. Diez -- http://mail.python.org/mailman/listinfo/python-list