Bug#833375: python3-debian: Cannot parse input of bytes under Python 3

Stuart Prescott Mon, 26 Dec 2016 00:07:08 -0800

Control: tags -1 + patch


Hi Chris!

I'm very suspicious of automagical decoding from bytes to strings -- it is 
asking for trouble. These are exactly the sorts of places were we make 
mojibake or choke by double-decoding. Then again, we want python-debian to be 
as useful as possible to people and I might have a workable plan.

Adding an extra acceptable input to `iter_paragraphs` actually looks easier 
than where you have ended up; the attached patch seems to work OK. I'd 
appreciate it if you could test it to confirm that it does indeed solve your 
use case. (python-debian comaintainers, I'd appreciate an ack from you too)

To the example code in question:

>     with open('Sources', 'rb') as f:
>         content = f.read()

this has explicitly opened the file as a *binary* file, there is no specified 
encoding and then the data is given to a *text* processor. Text parsers want 
text not bytes. Binary→text can't happen automatically without assuming an 
encoding. That it (accidentally) works while being imprecise about encodings 
with Python 2 is exactly the sort of thing that the str/unicode v str/bytes 
changes between Python 2 and 3 was supposed to catch -- as it did in the 
example.

I have a definite preference for being explicit and decoding/encoding at the 
program boundary with the caller dealing with the requested encoding (i.e. 
using `open(…, encoding=…)` in Python 3 or `codecs` in Python 2). 

>     for src in Sources.iter_paragraphs(content):
>          print((src['Package'], src['Version']))

Now for me, the redemption is that `iter_paragraphs` is already a bit of a 
DWIM interface that accepts a scary number of different things *and* it has an 
`encoding` keyword argument with a default value of 'utf-8'... having a 
specified encoding available enables automatic conversion.

As an aside, is there a real need in the example code to force the use of 
deb822's internal 822 parser rather than using the much faster one from apt's 
TagFile? The example snippet is clearly part of something bigger and perhaps 
you need the content separately -- you may find that passing the fd straight to 
iter_paragraphs is worthwhile.

cheers
Stuart


-- 
Stuart Prescott    http://www.nanonanonano.net/   [email protected]
Debian Developer   http://www.debian.org/         [email protected]
GPG fingerprint    90E2 D2C1 AD14 6A1B 7EBB 891D BBC1 7EBB 1396 F2F7

>From b5b8bd12b503f07640e9cacc9eb245a1d6fea469 Mon Sep 17 00:00:00 2001
From: Stuart Prescott <[email protected]>
Date: Mon, 26 Dec 2016 16:55:03 +1100
Subject: [PATCH] Allow iter_paragraphs to accept bytes

* Add test case for #833375
* Split the provided sequence at newlines if it is str or bytes in Python 3
  (this is a no-op in Python 2); the encoding is explicitly passed by the
  caller to iter_paragraphs with a default of utf-8 so this automatic
  decoding isn't as unsafe as it might be.

Closes: #833375
---
 lib/debian/deb822.py |  2 +-
 tests/test_deb822.py | 10 ++++++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/lib/debian/deb822.py b/lib/debian/deb822.py
index 8de03dc..6a9d21c 100644
--- a/lib/debian/deb822.py
+++ b/lib/debian/deb822.py
@@ -381,7 +381,7 @@ class Deb822(Deb822Dict):
                     yield paragraph
 
         else:
-            if isinstance(sequence, six.string_types):
+            if isinstance(sequence, six.string_types + (six.binary_type,)):
                 sequence = sequence.splitlines()
             iterable = iter(sequence)
             while True:
diff --git a/tests/test_deb822.py b/tests/test_deb822.py
index 2bf17af..2d76326 100755
--- a/tests/test_deb822.py
+++ b/tests/test_deb822.py
@@ -465,6 +465,16 @@ class TestDeb822(unittest.TestCase):
                             for d in deb822.Deb822.iter_paragraphs(text)])
             self.assertEqual(2, count)
 
+    def test_iter_paragraphs_bytes(self):
+        text = (UNPARSED_PACKAGE + '\n\n\n' + UNPARSED_PACKAGE)
+        if six.PY2:
+            binary = text
+        else:
+            binary = text.encode('utf-8')
+
+        for d in deb822.Deb822.iter_paragraphs(binary):
+            self.assertWellParsed(d, PARSED_PACKAGE)
+
     def test_iter_paragraphs_with_extra_whitespace(self):
         """ Paragraphs not elided when stray whitespace is between
 
-- 
2.1.4

-- 
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-python-debian-maint

Bug#833375: python3-debian: Cannot parse input of bytes under Python 3

Reply via email to