Hello fellow pythonists, I'm a relatively new python developer, and I try to adjust my understanding about "how things works" to python, but I have hit a block, that I cannot understand. I needed to output unicode datas back from a web service, and could not get back unicode/multibyte text before applying an hack that I don't understand (thank you google)
I have realized an wxPython simple application, that takes the input of a user, send it to a web service, and get back translations in several languages. The service itself is fully UTF-8. The "source" string is first encoded to "latin1" after a passage into unicode.normalize(), as urllib.quote() cannot work on unicode >>srcText=unicodedata.normalize('NFKD',srcText).encode('latin1','ignore') After that, an urllib request is sent with this encoded string to the web service >>con=urllib2.Request(self.url, headers={'User-Agent':'Mozilla/5.0 (X11; U; >>Linux i686) Gecko/20071127 Firefox/2.0.0.11'}, >>origin_req_host='http://translate.google.com') >>req=urllib2.urlopen(con) First problem, how to determine the encoding of the return ? If I inspect a request from firefox, I see that the server return header specify UTF-8 But if I use this code: >>ret=U'' >>for line in req: >> ret=ret+string.replace(line.strip(),'\n',chr(10)) I end up with an UnicodeDecodeError. I tried various line.decode(), line.normalize and such, but could not make this error disapear. I, until now, avoided that problem as the service always seems to return 1 line, but I am wondering. Second problem, if I try an >>print line into the loop, I too get the same error. I though that unicode() would force python to consider the given text as unicode, not to try to convert it to unicode. Here again, trying several normalize/decode combination did not helped at all. Then, looking for help through google, I have found this post: http://mail.python.org/pipermail/python-list/2007-October/462977.html and I gave it a try. What I did, though, was not to override sys.stdout, but to declare a new writer stream as a property of my main class: >>self.out=OutStreamEncoder(sys.stdout, 'utf-8') But what is strange, is that since I did that, even without using this self.out writer, the unicode translation are working as I was expecting them to. Except on the for loop, where a concatenation still triggers the UnicodeDecodeErro exception. I know the "explicit is better than implicit" python motto, and I really like it. But here, I don't understand what is going on. Does the fact that defining that writer object does a initialization of the standard sys.stdout object ? Does it is related to an internal usage of it, maybe in urllib ? I tried to find more on the subject, but felt short. Can someone explain to me what is happening ? The full script source can be found at http://www.webalis.com/translator/translator.pyw -- http://mail.python.org/mailman/listinfo/python-list