Thierry wrote:
Hello fellow pythonists,

I'm a relatively new python developer, and I try to adjust my
understanding about "how things works" to python, but I have hit a
block, that I cannot understand.
I needed to output unicode datas back from a web service, and could
not get back unicode/multibyte text before applying an hack that I
don't understand (thank you google)

I have realized an wxPython simple application, that takes the input
of a user, send it to a web service, and get back translations in
several languages.
The service itself is fully UTF-8.

The "source" string is first encoded to "latin1" after a passage into
unicode.normalize(), as urllib.quote() cannot work on unicode
srcText=unicodedata.normalize('NFKD',srcText).encode('latin1','ignore')

urllib.quote() operates on byte streams. If your web service is UTF-8
it would make sense to use UTF-8 as input encoding not latin1,
wouldn't it? unicodeinput.encode("utf-8")

After that, an urllib request is sent with this encoded string to the
web service
con=urllib2.Request(self.url, headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux 
i686) Gecko/20071127 Firefox/2.0.0.11'}, 
origin_req_host='http://translate.google.com')

req=urllib2.urlopen(con)

First problem, how to determine the encoding of the return ?

It is sent as part of the headers. e.g. content-type: text/html; charset=utf-8

If I inspect a request from firefox, I see that the server return
header specify UTF-8
But if I use this code:
ret=U''
for line in req:
 ret=ret+string.replace(line.strip(),'\n',chr(10))
I end up with an UnicodeDecodeError. I tried various line.decode(),
line.normalize and such, but could not make this error disapear.
I, until now, avoided that problem as the service always seems to
return 1 line, but I am wondering.

web server answer is encoded byte stream too (usually utf-8 but you
can check the headers)  so

line.decoce("utf-8") should give you unicode to operate on (always
do string operations on canonized form)

Second problem, if I try an
print line
into the loop, I too get the same error. I though that unicode() would
force python to consider the given text as unicode, not to try to
convert it to unicode.

But it is what it does. Basically unicode() is a constructor for
unicode objects.

Here again, trying several normalize/decode combination did not helped
at all.

Its not too complicated, you just need to keep unicode and byte strings
separate and draw a clean line between the two. (the line is decode() and encode() )

Then, looking for help through google, I have found this post:
http://mail.python.org/pipermail/python-list/2007-October/462977.html
and I gave it a try. What I did, though, was not to override
sys.stdout, but to declare a new writer stream as a property of my
main class:
self.out=OutStreamEncoder(sys.stdout, 'utf-8')


This is fancy but not needed if you take care like above.

HTH
Tino

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to