Thierry wrote:
Hello fellow pythonists,I'm a relatively new python developer, and I try to adjust my understanding about "how things works" to python, but I have hit a block, that I cannot understand. I needed to output unicode datas back from a web service, and could not get back unicode/multibyte text before applying an hack that I don't understand (thank you google) I have realized an wxPython simple application, that takes the input of a user, send it to a web service, and get back translations in several languages. The service itself is fully UTF-8. The "source" string is first encoded to "latin1" after a passage into unicode.normalize(), as urllib.quote() cannot work on unicodesrcText=unicodedata.normalize('NFKD',srcText).encode('latin1','ignore')
urllib.quote() operates on byte streams. If your web service is UTF-8 it would make sense to use UTF-8 as input encoding not latin1, wouldn't it? unicodeinput.encode("utf-8")
After that, an urllib request is sent with this encoded string to the web servicecon=urllib2.Request(self.url, headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}, origin_req_host='http://translate.google.com')req=urllib2.urlopen(con)First problem, how to determine the encoding of the return ?
It is sent as part of the headers. e.g. content-type: text/html; charset=utf-8
If I inspect a request from firefox, I see that the server return header specify UTF-8 But if I use this code:ret=U'' for line in req: ret=ret+string.replace(line.strip(),'\n',chr(10))I end up with an UnicodeDecodeError. I tried various line.decode(), line.normalize and such, but could not make this error disapear. I, until now, avoided that problem as the service always seems to return 1 line, but I am wondering.
web server answer is encoded byte stream too (usually utf-8 but you can check the headers) so line.decoce("utf-8") should give you unicode to operate on (always do string operations on canonized form)
Second problem, if I try anprint lineinto the loop, I too get the same error. I though that unicode() would force python to consider the given text as unicode, not to try to convert it to unicode.
But it is what it does. Basically unicode() is a constructor for unicode objects.
Here again, trying several normalize/decode combination did not helped at all.
Its not too complicated, you just need to keep unicode and byte stringsseparate and draw a clean line between the two. (the line is decode() and encode() )
Then, looking for help through google, I have found this post: http://mail.python.org/pipermail/python-list/2007-October/462977.html and I gave it a try. What I did, though, was not to override sys.stdout, but to declare a new writer stream as a property of my main class:self.out=OutStreamEncoder(sys.stdout, 'utf-8')
This is fancy but not needed if you take care like above. HTH Tino
smime.p7s
Description: S/MIME Cryptographic Signature
-- http://mail.python.org/mailman/listinfo/python-list