Better encoding support for python 2

Mihai Ibanescu Tue, 10 Jul 2018 12:33:29 -0700

Hi,

We have run across an older deb file:


http://ubuntu-master.mirror.tudos.de/ubuntu/pool/universe/a/aspell-is/aspell-is_0.51-0-4_all.deb

One of its files, usr/lib/aspell/íslenska.alias, is not utf8-encoded in the
control file.

This exposed what I think is a bug in deb822.Deb822: in python 2, I cannot
load a sequence (dictionary) in one encoding and dump it into a different
encoding. This works fine in python3. The difference is that keys are
internally stored as text both in PY2 and PY3, but they mean different
things. In PY3, text means unicode, so the original encoding is irrelevant.
In PY2, text means binary (in PY3 parlance), and the original encoding is
relevant.

To simplify the problem, I will only use the first offending letter of the
file that has problems, í (\xed in iso-8859-1). Here is my test script:


from debian import deb822

obj = deb822.Deb822({'\xed': 'i'}, encoding='iso-8859-1')
print(obj.dump(encoding='utf-8'))


Running it in python3:

python3.6 test.py
í: i

Running it in python 2.7:
python2.7 test.py
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 0:
ordinal not in range(128)

Another bug in PY2 is related to the implementation of __str__: it should
return a string object, but self.dump() returns Unicode.

The attached patch fixes both of those problems.

I will be happy to write a test but I wanted to get some feedback about the
correctness of the patch first.

There are also a lot of unreleased patches in git, and it would be nice if
they were tagged as a release.

If there is a process I need to follow in order to submit the patch (i.e.
for a repo, sign a contributor agreement etc) please let me know and I will
do that too.

Thanks!
Mihai

diff --git a/lib/debian/deb822.py b/lib/debian/deb822.py
index 79e6842..9d1234f 100644
--- a/lib/debian/deb822.py
+++ b/lib/debian/deb822.py
@@ -682,12 +682,15 @@ class Deb822(Deb822Dict):
             self[curkey] = content
 
     def __str__(self):
+        if six.PY2:
+            # self.dump() returns unicode
+            return self.dump().encode(self.encoding)
         return self.dump()
 
     def __unicode__(self):
         return self.dump()
 
-    if sys.version >= '3':
+    if six.PY3:
         def __bytes__(self):
             return self.dump().encode(self.encoding)
 
@@ -742,14 +745,18 @@ class Deb822(Deb822Dict):
 
         for key in self:
             value = self.get_as_string(key)
+            keyenc = key
+            if isinstance(keyenc, six.binary_type):
+                # Convert the key into unicode
+                keyenc = key.decode(self.encoding)
             if not value or value[0] == '\n':
                 # Avoid trailing whitespace after "Field:" if it's on its own
                 # line or the value is empty.  We don't have to worry about the
                 # case where value == '\n', since we ensure that is not the
                 # case in __setitem__.
-                entry = '%s:%s\n' % (key, value)
+                entry = '%s:%s\n' % (keyenc, value)
             else:
-                entry = '%s: %s\n' % (key, value)
+                entry = '%s: %s\n' % (keyenc, value)
             if not return_string and not text_mode:
                 fd.write(entry.encode(encoding))
             else:

-- 
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/pkg-python-debian-maint

Better encoding support for python 2

Reply via email to