#2116: Application layer protocol for transfering RPC messages + utf8 decoding
error
-------------------+--------------------------------------------------------
Reporter: bro | Owner:
Type: patch | Status: new
Priority: major | Milestone: Future
Component: other | Version: 1.3.5
Keywords: |
-------------------+--------------------------------------------------------
Comment(by andar):
Great analysis work.
So it looks like we have two problems: an issue with the RPC messaging and
an issue with rencode. It looks like you've solved the first one and your
reasoning makes sense, so I'll work on getting this applied to master.
I've taken a look at the second problem involving rencode and I at least
understand why it's happening, but I'm still not sure on how to go about
fixing it. With rencode we expect all strings (byte strings) to be either
utf8 or ascii encoded (ascii is a subset of utf8 which is why it works).
If a unicode object is passed into rencode, it will first encode it into a
utf8 bytestring. During a decode of a string, rencode will attempt to
decode the string as utf8 so that it will return a unicode object.
{{{
>>> data = u"\xe5"
>>> print data
å
>>> rencode.dumps(data)
'\x82\xc3\xa5'
>>> rencode.loads(rencode.dumps(data))
u'\xe5'
}}}
{{{
>>> data = "foo"
>>> rencode.dumps(data)
'\x83foo'
>>> rencode.loads(rencode.dumps(data))
u'foo'
}}}
When the string is passed in as a unicode object, rencode behaves as
expected returning an unicode object when doing a loads().
The problem arises when you pass in an string that is neither ascii or
utf8 encoded and then try to loads() from rencode.
{{{
>>> data = "\xe5"
>>> rencode.dumps(data)
'\x81\xe5'
>>> rencode.loads(rencode.dumps(data))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "rencode.pyx", line 498, in rencode._rencode.loads
(rencode/rencode.c:5439)
File "rencode.pyx", line 466, in rencode._rencode.decode
(rencode/rencode.c:5131)
File "rencode.pyx", line 386, in rencode._rencode.decode_fixed_str
(rencode/rencode.c:4159)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 0:
unexpected end of data
}}}
The dumps() works just fine because rencode does nothing to bytestrings
passed in as it expects these to be proper ascii or utf8 encoded strings,
but when it comes to the loads() rencode will attempt to decode the
bytestring as utf8 to produce a unicode object.
A couple options to fix this:
* Don't. Simply enforce the fact that we should be using proper utf8
strings and fix the source of these malformed strings.
* Allow the use of different encoded bytestrings by not attempting a
utf8 decode during the loads(). This means that if you pass a unicode
object into rencode.dumps(), you will not get a unicode object out on the
subsequent loads() but rather a bytestring in an unknown encoding.
I'm really not sure if the latter option would have any effect on Deluge
or not. Quite frankly our handling of string encodings across the board
is pretty messed up so it's a bit scary changing something as fundamental
as this. That being said, if you have tried using pickle and it has
worked for you, this may turn out ok for us as well, as I assume this is
how pickle approaches encoded bytestrings. On the other hand, I kind of
like the idea of enforcing the use of utf8 within Deluge as we really
shouldn't be using any other encoding for anything.
Anybody have any thoughts on this? I suppose we could do some tests and
see if removing the string decoding has any real effects.
--
Ticket URL: <http://dev.deluge-torrent.org/ticket/2116#comment:4>
Deluge <http://deluge-torrent.org/>
Deluge project
--
You received this message because you are subscribed to the Google Groups
"Deluge Dev" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/deluge-dev?hl=en.