Dear python-debian maintainers,

There have been two on-going sore points in deb822 handling:

* maintainers accidentally have whitespace in the blank line separating 
paragraphs in their d/control files. Policy discourages this but permits it and 
dpkg handles it ok so they don't notice until wrap-and-sort deletes entire 
packages from d/control on them.

* maintainers are permitted to use comments (# at the start of a line) within 
d/control but if they do so in the middle of a multi-line field, the field is 
then truncated and information would be lost by wrap-and-sort.

(I'm highlighting wrap-and-sort here, but other parsers of the deb822 format 
like autopkgtest , piuparts and various maintainer tools like 
debian/autodeps.py will also be affected)

We have two different deb822 parsers exposed within python-debian -- apt's 
TagFile has been used whenever possible (on real files only, not on other 
iterables), and an "internal" parser that just uses some regular expressions 
in python. These two sets of bugs come from apt's TagFile being a fast but 
strict parser that doesn't support these extensions because it was neither 
required nor desirable for it to do so. For most uses of deb822's 
iter_paragraphs such as wrap-and-sort, the speed improvement of using TagFile 
isn't needed and the strictness that is imposed by TagFile is unhelpful.

My proposal is then:

* ensure the internal parser supports whitespace and comments as required

* switch the default parser in iter_paragraphs to the internal parser 
(changing the kwarg to be use_apt_pkg=False) and clearly document the 
situations where an application might safely choose to use TagFile 
(use_apt_pkg=True).

* provide convenience classmethods for in the Packages and Sources classes 
that do default to using TagFile since these files should be syntactically 
strict and performance is important here.

The potential downside of this is that code that uses Deb822.iter_paragraphs 
on big files and has not already explicitly chosen use_apt_pkg=True will suffer 
a performance hit (although it will still work). The only example of this in 
the archive that I can see is in apt-xapian-index:

        deb822.Deb822.iter_paragraphs(open("/var/lib/debtags/vocabulary", "r")):

(I'll cheerfully send a patch for this)

Code that is already using Packages.iter_paragraphs or Sources.iter_paragraphs 
will be unaffected (this works because iter_paragraphs is an inherited 
classmethod but I can't see any examples of it in the archive). 

On the upside, the two bugs (plus their many dupes) python-debian has 
inherited from wrap-and-sort will be fixed.

The two patches are attached or, if you'd rather see the context, 

  git clone ssh://git.nanonanonano.net/srv/git/python-debian.git -b paragraphs

(if I update the patches based on comments, I can't promise not to rewrite 
history on that branch!)


I'd very much like some feedback on these proposed changes rather than just 
ending in a consensual silence ;)

cheers
Stuart


-- 
Stuart Prescott    http://www.nanonanonano.net/   [email protected]
Debian Developer   http://www.debian.org/         [email protected]
GPG fingerprint    90E2 D2C1 AD14 6A1B 7EBB 891D BBC1 7EBB 1396 F2F7
From a74abceb1c0b681ba90062254e490be3686512cd Mon Sep 17 00:00:00 2001
From: Stuart Prescott <[email protected]>
Date: Wed, 4 Jun 2014 00:49:01 +1000
Subject: [PATCH 1/4] Allow whitespace-only lines to separate paragraphs

iter_paragraphs should split paragraphs on whitespace-only lines.

Paragraphs should be separated by a blank line but policy permits parsers to
permit whitespace only lines too.

Closes #715558.
---
 lib/debian/deb822.py |  4 +++-
 tests/test_deb822.py | 24 ++++++++++++++++++++++++
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/lib/debian/deb822.py b/lib/debian/deb822.py
index 3368aaf..4eb4e40 100644
--- a/lib/debian/deb822.py
+++ b/lib/debian/deb822.py
@@ -627,7 +627,9 @@ class Deb822(Deb822Dict):
         gpg_post_lines = []
         state = b'SAFE'
         gpgre = re.compile(br'^-----(?P<action>BEGIN|END) PGP (?P<what>[^-]+)-----$')
-        blank_line = re.compile(b'^$')
+        # Include whitespace-only lines in blank lines to split paragraphs.
+        # (see #715558)
+        blank_line = re.compile(b'^\s*$')
         first_line = True
 
         for line in sequence:
diff --git a/tests/test_deb822.py b/tests/test_deb822.py
index 26efc1c..f3c32a1 100755
--- a/tests/test_deb822.py
+++ b/tests/test_deb822.py
@@ -445,6 +445,30 @@ class TestDeb822(unittest.TestCase):
                 self.assertWellParsed(d, PARSED_PACKAGE)
             self.assertEqual(count, 2)
 
+    def test_iter_paragraphs_with_extra_whitespace(self):
+        """ Paragraphs not elided when stray whitespace is between
+
+        From policy ยง5.1:
+
+            The paragraphs are separated by empty lines. Parsers may accept
+            lines consisting solely of spaces and tabs as paragraph separators,
+            but control files should use empty lines.
+
+        On the principle of "be strict in what you send; be generous in
+        what you receive", deb822 should permit such extra whitespace between
+        deb822 stanzas.
+
+        See #715558 for further details.
+        """
+        for extra_space in (" ", "  ", "\t"):
+            text = (UNPARSED_PACKAGE + '%s\n' % extra_space
+                        + UNPARSED_PACKAGE).splitlines()
+            count = 0
+            for d in deb822.Deb822.iter_paragraphs(text):
+                count += 1
+            self.assertEqual(count, 2,
+                        "Wrong number paragraphs were found: %d != 2" % count)
+
     def _test_iter_paragraphs(self, filename, cls, **kwargs):
         """Ensure iter_paragraphs consistency"""
         
-- 
1.9.1

From 4aa7e0084bc6ed7866c58405750960c2d107b55f Mon Sep 17 00:00:00 2001
From: Stuart Prescott <[email protected]>
Date: Wed, 30 Jul 2014 21:49:49 +1000
Subject: [PATCH 2/4] Switch default to internal parser for deb822

Prefer the internal parser rather than apt's TagFile for processing deb822
files unless explicitly called to process Packages or Sources files:
 - prevents paragraph parsing truncating on comments (Closes: #743174).
 - fix parsing of paragraphs when separated by more whitespace than just a
   newline (Closes: #655988). (Finally fixing interactions with devscripts'
   wrap-and-sort.)
---
 debian/control       |  2 +-
 debian/tests/control |  2 ++
 lib/debian/deb822.py | 37 ++++++++++++++++++++++++++---
 tests/test_deb822.py | 66 ++++++++++++++++++++++++++++++++++++++++------------
 4 files changed, 88 insertions(+), 19 deletions(-)

diff --git a/debian/control b/debian/control
index 40519d2..0d69a6b 100644
--- a/debian/control
+++ b/debian/control
@@ -9,7 +9,7 @@ Uploaders: Adeodato Sim?? <[email protected]>,
  Stefano Zacchiroli <[email protected]>,
  John Wright <[email protected]>,
  Stuart Prescott <[email protected]>
-Build-Depends: debhelper (>= 9), dh-python, python-all (>= 2.6.6-3~), python3-all (>= 3.1.2-8~), python-setuptools, python3-setuptools, python-chardet, python3-chardet, python-nose, python3-nose, python-six, python3-six
+Build-Depends: debhelper (>= 9), dh-python, python-all (>= 2.6.6-3~), python3-all (>= 3.1.2-8~), python-setuptools, python3-setuptools, python-apt, python3-apt, python-chardet, python3-chardet, python-nose, python3-nose, python-six, python3-six
 Standards-Version: 3.9.5
 Vcs-Browser: http://anonscm.debian.org/gitweb/?p=pkg-python-debian/python-debian.git
 Vcs-Git: git://anonscm.debian.org/pkg-python-debian/python-debian.git
diff --git a/debian/tests/control b/debian/tests/control
index eecd0e5..c1ea462 100644
--- a/debian/tests/control
+++ b/debian/tests/control
@@ -1,11 +1,13 @@
 Tests: python3-debian
 Depends: python3-nose,
+ python3-apt,
  python3-debian,
  debian-keyring,
  debian-archive-keyring
 
 Tests: python-debian
 Depends: python2.7,
+ python-apt,
  python-nose,
  python-debian,
  debian-keyring,
diff --git a/lib/debian/deb822.py b/lib/debian/deb822.py
index 4eb4e40..ee87256 100644
--- a/lib/debian/deb822.py
+++ b/lib/debian/deb822.py
@@ -340,7 +340,7 @@ class Deb822(Deb822Dict):
 
         self.gpg_info = None
 
-    def iter_paragraphs(cls, sequence, fields=None, use_apt_pkg=True,
+    def iter_paragraphs(cls, sequence, fields=None, use_apt_pkg=False,
                         shared_storage=False, encoding="utf-8"):
         """Generator that yields a Deb822 object for each paragraph in sequence.
 
@@ -348,9 +348,14 @@ class Deb822(Deb822Dict):
 
         :param fields: likewise.
 
-        :param use_apt_pkg: if sequence is a file, apt_pkg will be used
+        :param use_apt_pkg: if sequence is a file, apt_pkg can be used
             if available to parse the file, since it's much much faster.  Set
-            this parameter to False to disable using apt_pkg.
+            this parameter to True to enable use of apt_pkg. Note that the
+            TagFile parser from apt_pkg is a much stricter parser of the
+            Deb822 format, particularly with regards whitespace between
+            paragraphs and comments within paragraphs. If these features are
+            required (for example in debian/control files), ensure that this
+            parameter is set to False.
         :param shared_storage: not used, here for historical reasons.  Deb822
             objects never use shared storage anymore.
         :param encoding: Interpret the paragraphs in this encoding.
@@ -1288,6 +1293,19 @@ class Sources(Dsc, _PkgRelationMixin):
         Dsc.__init__(self, *args, **kwargs)
         _PkgRelationMixin.__init__(self, *args, **kwargs)
 
+    @classmethod
+    def iter_paragraphs(cls, sequence, fields=None, use_apt_pkg=True,
+                        shared_storage=False, encoding="utf-8"):
+        """Generator that yields a Deb822 object for each paragraph in Sources.
+
+        Note that this overloaded form of the generator uses apt_pkg (a strict
+        but fast parser) by default.
+
+        See the Deb822.iter_paragraphs function for details.
+        """
+        return super(Sources, cls).iter_paragraphs(sequence, fields,
+                                    use_apt_pkg, shared_storage, encoding)
+
 
 class Packages(Deb822, _PkgRelationMixin):
     """Represent an APT binary package list"""
@@ -1300,6 +1318,19 @@ class Packages(Deb822, _PkgRelationMixin):
         Deb822.__init__(self, *args, **kwargs)
         _PkgRelationMixin.__init__(self, *args, **kwargs)
 
+    @classmethod
+    def iter_paragraphs(cls, sequence, fields=None, use_apt_pkg=True,
+                        shared_storage=False, encoding="utf-8"):
+        """Generator that yields a Deb822 object for each paragraph in Packages.
+
+        Note that this overloaded form of the generator uses apt_pkg (a strict
+        but fast parser) by default.
+
+        See the Deb822.iter_paragraphs function for details.
+        """
+        return super(Packages, cls).iter_paragraphs(sequence, fields,
+                                    use_apt_pkg, shared_storage, encoding)
+
 
 class _CaseInsensitiveString(str):
     """Case insensitive string.
diff --git a/tests/test_deb822.py b/tests/test_deb822.py
index f3c32a1..6fe06e3 100755
--- a/tests/test_deb822.py
+++ b/tests/test_deb822.py
@@ -244,6 +244,9 @@ Section: bar
 # An inline comment in the middle of a paragraph should be ignored.
 Priority: optional
 Homepage: http://www.debian.org/
+Build-Depends: debhelper,
+# quux, (temporarily disabled by a comment character)
+ python
 
 # Comments in the middle shouldn't result in extra blank paragraphs either.
 
@@ -266,6 +269,7 @@ PARSED_PARAGRAPHS_WITH_COMMENTS = [
         ('Section', 'bar'),
         ('Priority', 'optional'),
         ('Homepage', 'http://www.debian.org/'),
+        ('Build-Depends', 'debhelper,\n python'),
     ]),
     deb822.Deb822Dict([
         ('Package', 'foo'),
@@ -431,7 +435,10 @@ class TestDeb822(unittest.TestCase):
     def test_iter_paragraphs_file(self):
         text = StringIO(UNPARSED_PACKAGE + '\n\n\n' + UNPARSED_PACKAGE)
 
-        for d in deb822.Deb822.iter_paragraphs(text):
+        for d in deb822.Deb822.iter_paragraphs(text, use_apt_pkg=False):
+            self.assertWellParsed(d, PARSED_PACKAGE)
+
+        for d in deb822.Deb822.iter_paragraphs(text, use_apt_pkg=True):
             self.assertWellParsed(d, PARSED_PACKAGE)
 
     def test_iter_paragraphs_with_gpg(self):
@@ -461,13 +468,30 @@ class TestDeb822(unittest.TestCase):
         See #715558 for further details.
         """
         for extra_space in (" ", "  ", "\t"):
-            text = (UNPARSED_PACKAGE + '%s\n' % extra_space
-                        + UNPARSED_PACKAGE).splitlines()
-            count = 0
-            for d in deb822.Deb822.iter_paragraphs(text):
-                count += 1
+            text = six.u(UNPARSED_PACKAGE) + '%s\n' % extra_space + \
+                        six.u(UNPARSED_PACKAGE)
+            count = len([p for p in deb822.Deb822.iter_paragraphs(text)])
             self.assertEqual(count, 2,
-                        "Wrong number paragraphs were found: %d != 2" % count)
+                        "Wrong number paragraphs were found in list: %d != 2" % count)
+
+            fd, filename = tempfile.mkstemp()
+            fp = os.fdopen(fd, 'wb')
+            fp.write(text.encode('utf-8'))
+            fp.close()
+
+            try:
+                with open_utf8(filename) as fh:
+                    count = len([p for p in deb822.Deb822.iter_paragraphs(fh)])
+                    self.assertEqual(count, 2,
+                            "Wrong number paragraphs were found in file: %d != 2" % count)
+                with open_utf8(filename) as fh:
+                    count = len([p for p in deb822.Packages.iter_paragraphs(fh)])
+                    # this time the apt_pkg parser should be used and this
+                    # *should* elide the paragraphs and make a mess
+                    self.assertEqual(count, 1,
+                            "Wrong number paragraphs were found in file: %d != 1" % count)
+            finally:
+                os.remove(filename)
 
     def _test_iter_paragraphs(self, filename, cls, **kwargs):
         """Ensure iter_paragraphs consistency"""
@@ -856,20 +880,32 @@ Description: python modules to work with Debian-related data formats
                                   PARSED_PARAGRAPHS_WITH_COMMENTS[i])
 
     def test_iter_paragraphs_comments_use_apt_pkg(self):
-        paragraphs = list(deb822.Deb822.iter_paragraphs(
-            UNPARSED_PARAGRAPHS_WITH_COMMENTS.splitlines(), use_apt_pkg=True))
-        self._test_iter_paragraphs_comments(paragraphs)
+        """ apt_pkg does not support comments within multiline fields
+
+        This test actually checks that that the file is *incorrectly* parsed
+        to ensure that this behaviour doesn't unknowingly and accidentally
+        change in the future.
+
+        See also https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=750247#35
+        """
+        fd, filename = tempfile.mkstemp()
+        fp = os.fdopen(fd, 'wb')
+        fp.write(UNPARSED_PARAGRAPHS_WITH_COMMENTS.encode('utf-8'))
+        fp.close()
+
+        try:
+            with open_utf8(filename) as fh:
+                paragraphs = list(deb822.Deb822.iter_paragraphs(
+                    fh, use_apt_pkg=True))
+                self.assertEqual(paragraphs[0]['Build-Depends'], 'debhelper,')
+        finally:
+            os.remove(filename)
 
     def test_iter_paragraphs_comments_native(self):
         paragraphs = list(deb822.Deb822.iter_paragraphs(
             UNPARSED_PARAGRAPHS_WITH_COMMENTS.splitlines(), use_apt_pkg=False))
         self._test_iter_paragraphs_comments(paragraphs)
 
-    def test_iter_paragraphs_string_comments_use_apt_pkg(self):
-        paragraphs = list(deb822.Deb822.iter_paragraphs(
-            UNPARSED_PARAGRAPHS_WITH_COMMENTS, use_apt_pkg=True))
-        self._test_iter_paragraphs_comments(paragraphs)
-
     def test_iter_paragraphs_string_comments_native(self):
         paragraphs = list(deb822.Deb822.iter_paragraphs(
             UNPARSED_PARAGRAPHS_WITH_COMMENTS, use_apt_pkg=False))
-- 
1.9.1

-- 
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-python-debian-maint

Reply via email to