Bug#538376: python-debian: debian_bundle parses some multilines fields inconsistantly

2009-07-26 Thread sean finney
hi john,

On Sat, Jul 25, 2009 at 07:38:57PM +0200, John Wright wrote:
  A temporary workaround is to pass use_apt_pkg=False to the
  iter_paragraphs method.  On large files, you'll probably notice a bit of
  a performance hit.  I'll see what we can do to preserve this information
  with apt_pkg.

yes, that seems to make the problem go away.

 I'm actually inclined to turn off using apt_pkg by default.  It's
 definitely faster, typically by a factor between 2 and 2.5, but we keep
 running into weird corner cases with the way apt_pkg parses things.

for me anyway, speed is not a huge concern as the code path occurs during
a behind-the-scenes cronjob.  i guess in an ideal world apt_pkg would
correctly parse the file and it'd be a non-issue, but barring that i'd
throw my vote in for the slower but correct implementation as well... if
only because i don't want to have to support two different versions of
deb822.Sources in my code (production is running on lenny, development
on squeeze/sid).


sean

-- 


signature.asc
Description: Digital signature


Bug#538376: python-debian: debian_bundle parses some multilines fields inconsistantly

2009-07-26 Thread Stefano Zacchiroli
On Sat, Jul 25, 2009 at 07:38:57PM +0200, John Wright wrote:
 I'm actually inclined to turn off using apt_pkg by default.  It's
 definitely faster, typically by a factor between 2 and 2.5, but we
 keep running into weird corner cases with the way apt_pkg parses
 things.
snip
 Clearly, using shared storage (which basically just means using
 apt_pkg's parser.Section directly) is blazingly fast compared to
 without.  But this is aready not the default, since it has the
 confusing side-effect of making the object returned by each
 iteration (each of which has a different id) actually share the same
 data.
 
 Is anybody strongly opposed to making iter_paragraphs not use
 apt_pkg by default?  I'm still trying to figure out a way to salvage
 the output from apt_pkg in this case, but I'm not having much luck.

I'm not against changing the default. However, I'm against making it
difficult / undocumented how to use apt_pkg + shared storage. I do
care about runtimes but I want to retain the nice data structures of
deb822 nevertheless (i.e. please not make me use python-apt all
alone).

What it might make sense then is to have different top level
functions, with one maybe marked fast and well documenting that it
has potential drawback. The other can safely be our, slower, default.

Just my 0.02€

-- 
Stefano Zacchiroli -o- PhD in Computer Science \ PostDoc @ Univ. Paris 7
z...@{upsilon.cc,pps.jussieu.fr,debian.org} -- http://upsilon.cc/zack/
Dietro un grande uomo c'è ..|  .  |. Et ne m'en veux pas si je te tutoie
sempre uno zaino ...| ..: | Je dis tu à tous ceux que j'aime


signature.asc
Description: Digital signature


Bug#538376: python-debian: debian_bundle parses some multilines fields inconsistantly

2009-07-26 Thread John Wright
On Sun, Jul 26, 2009 at 12:34:47PM +0200, Stefano Zacchiroli wrote:
 On Sat, Jul 25, 2009 at 07:38:57PM +0200, John Wright wrote:
  I'm actually inclined to turn off using apt_pkg by default.  It's
  definitely faster, typically by a factor between 2 and 2.5, but we
  keep running into weird corner cases with the way apt_pkg parses
  things.
 snip
  Clearly, using shared storage (which basically just means using
  apt_pkg's parser.Section directly) is blazingly fast compared to
  without.  But this is aready not the default, since it has the
  confusing side-effect of making the object returned by each
  iteration (each of which has a different id) actually share the same
  data.
  
  Is anybody strongly opposed to making iter_paragraphs not use
  apt_pkg by default?  I'm still trying to figure out a way to salvage
  the output from apt_pkg in this case, but I'm not having much luck.
 
 I'm not against changing the default. However, I'm against making it
 difficult / undocumented how to use apt_pkg + shared storage. I do
 care about runtimes but I want to retain the nice data structures of
 deb822 nevertheless (i.e. please not make me use python-apt all
 alone).
 
 What it might make sense then is to have different top level
 functions, with one maybe marked fast and well documenting that it
 has potential drawback. The other can safely be our, slower, default.

I don't quite understand what you mean.  Are you thinking of a method
like deb822.Deb822.iter_paragraphs_fast(...), which is basically a
wrapper around iter_paragraphs with use_apt_pkg=True and
shared_storage=True?

I wouldn't be opposed to something like that.  Somehow, we just need to
make it clear that using the native parser, which is slower than
apt_pkg, is the only way to guarantee that modifying a Deb822 object and
then dumping it back out will produce output consistent with the input.

I really wish it were possible to ensure the apt_pkg parser gave the
same results as the native parser, though. :(

-- 
John Wright j...@debian.org



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#538376: python-debian: debian_bundle parses some multilines fields inconsistantly

2009-07-26 Thread Stefano Zacchiroli
On Sun, Jul 26, 2009 at 03:07:18PM +0200, John Wright wrote:
 I don't quite understand what you mean.  Are you thinking of a method
 like deb822.Deb822.iter_paragraphs_fast(...), which is basically a
 wrapper around iter_paragraphs with use_apt_pkg=True and
 shared_storage=True?

Yes, sorry for not having been clear in my wording.

-- 
Stefano Zacchiroli -o- PhD in Computer Science \ PostDoc @ Univ. Paris 7
z...@{upsilon.cc,pps.jussieu.fr,debian.org} -- http://upsilon.cc/zack/
Dietro un grande uomo c'è ..|  .  |. Et ne m'en veux pas si je te tutoie
sempre uno zaino ...| ..: | Je dis tu à tous ceux que j'aime



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#538376: python-debian: debian_bundle parses some multilines fields inconsistantly

2009-07-26 Thread John Wright
On Sun, Jul 26, 2009 at 03:06:58PM +0200, John Wright wrote:
 I really wish it were possible to ensure the apt_pkg parser gave the
 same results as the native parser, though. :(

Looks like we can actually do this, with a small patch to python-apt [1]
and the attached patch.

 [1]: http://bugs.debian.org/538723

-- 
John Wright j...@debian.org
commit a26b007a91c9723d33690e7e0b897c70c2006ee5
Author: John Wright j...@johnwright.org
Date:   Sun Jul 26 17:20:20 2009 +0200

Don't ignore leading newlines on field data with apt_pkg

This depends on the patch to python-apt in #538723 being accepted.

diff --git a/debian_bundle/deb822.py b/debian_bundle/deb822.py
index fb1ed7c..a5662af 100644
--- a/debian_bundle/deb822.py
+++ b/debian_bundle/deb822.py
@@ -37,6 +37,31 @@ import sys
 import StringIO
 import UserDict
 
+class TagSectionWrapper(object, UserDict.DictMixin):
+Wrap a TagSection object, using the FindRaw method instead of Find
+
+This allows us to pick which whitespace to strip off the beginning and end
+of the data, so we don't lose leading newlines.
+
+
+def __init__(self, parser):
+self.parser = parser
+
+def keys(self):
+return self.parser.keys()
+
+def __getitem__(self, key):
+s = self.parser.FindRaw(key)
+
+if s is None:
+raise KeyError(key)
+
+data = s.partition(':')[2]
+
+# Get rid of spaces and tabs after the :, but not newlines, and strip
+# off any newline at the end of the data.
+return data.lstrip(' \t').rstrip('\n')
+
 class OrderedSet(object):
 A set-like object that preserves order when iterating over it
 
@@ -242,16 +267,12 @@ class Deb822(Deb822Dict):
 
 if _have_apt_pkg and use_apt_pkg and isinstance(sequence, file):
 parser = apt_pkg.ParseTagFile(sequence)
+section = TagSectionWrapper(parser.Section)
 while parser.Step() == 1:
 if shared_storage:
-parsed = parser.Section
+parsed = section
 else:
-# Since parser.Section doesn't have an items method, we
-# need to imitate that method here and make a Deb822Dict
-# from the result in order to preserve order.
-items = [(key, parser.Section[key])
- for key in parser.Section.keys()]
-parsed = Deb822Dict(items)
+parsed = Deb822Dict(section)
 yield cls(fields=fields, _parsed=parsed)
 
 else:


Bug#538376: python-debian: debian_bundle parses some multilines fields inconsistantly

2009-07-25 Thread sean finney
Package: python-debian
Version: 0.1.14
Severity: important

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

hi!

i discovered this one when trying to run the patch-tracker.debian.net code
on an updated sid system.

if a record in a Sources file has only a single checksum entry, it's returned
as a single dict object.  if it has multiple entries, they are returned as
a list of dicts.  for example, with an abbriviated Sources file like this:

cat  EOF  sourcesfile
Package: debian-archive-keyring
Files:
 2675031b2286ca8dfc085e2a9c9d38ed 838 debian-archive-keyring_2009.01.31.dsc
 5365c07ddcf639544933552e31a571ee 13627 debian-archive-keyring_2009.01.31.tar.gz
Checksums-Sha1:
 dd987a5ee85d8c4fc3e115f60a6a021e78c626dd 13627 
debian-archive-keyring_2009.01.31.tar.gz

Package: povray
Files:
 8ba29fc593db0da2f69b24cb732a787a 1404 povray_3.7.0~beta29-1.dsc
 5f18fcdf332164151fc8f8e200f9bffe 6761402 povray_3.7.0~beta29.orig.tar.gz
 0b7ea3d5724f5f7a50c0277a194fea3f 24696 povray_3.7.0~beta29-1.diff.gz
Checksums-Sha1:
 c2b88b819f5b8fe731990bec2e02654d67c5df34 6761402 
povray_3.7.0~beta29.orig.tar.gz
 bf7ed630c44fa37b0eb24b666a5a16c2701ec202 24696 povray_3.7.0~beta29-1.diff.gz
EOF

and the following code:

slist = deb822.Sources.iter_paragraphs(file('sourcesfile'))
for ent in slist:
  print ent['Checksums-Sha1']

you get:

{'sha1': 'dd987a5ee85d8c4fc3e115f60a6a021e78c626dd', 'size': '13627', 'name': 
'debian-archive-keyring_2009.01.31.tar.gz'}
[{'sha1': 'c2b88b819f5b8fe731990bec2e02654d67c5df34', 'size': '6761402', 
'name': 'povray_3.7.0~beta29.orig.tar.gz'}, {'sha1': 
'bf7ed630c44fa37b0eb24b666a5a16c2701ec202', 'size': '24696', 'name': 
'povray_3.7.0~beta29-1.diff.gz'}

and i would argue you should get:

[{'sha1': 'dd987a5ee85d8c4fc3e115f60a6a021e78c626dd', 'size': '13627', 'name': 
'debian-archive-keyring_2009.01.31.tar.gz'}]
[{'sha1': 'c2b88b819f5b8fe731990bec2e02654d67c5df34', 'size': '6761402', 
'name': 'povray_3.7.0~beta29.orig.tar.gz'}, {'sha1': 
'bf7ed630c44fa37b0eb24b666a5a16c2701ec202', 'size': '24696', 'name': 
'povray_3.7.0~beta29-1.diff.gz'}

which i believe was the previous behavior.

i should be able to code around this problem but it does seem to be
a regression.

sean

- -- System Information:
Debian Release: squeeze/sid
  APT prefers unstable
  APT policy: (500, 'unstable'), (500, 'testing'), (1, 'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.30-1-amd64 (SMP w/2 CPU cores)
Locale: LANG=en_US.utf8, LC_CTYPE=en_US.utf8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash

Versions of packages python-debian depends on:
ii  python2.5.4-2An interactive high-level object-o
ii  python-support1.0.3  automated rebuilding support for P

Versions of packages python-debian recommends:
ii  python-apt0.7.11.0   Python interface to libapt-pkg

Versions of packages python-debian suggests:
ii  gpgv  1.4.9-4GNU privacy guard - signature veri

- -- no debconf information

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iD8DBQFKat+/ynjLPm522B0RAorKAJkBUww/FFblYe89Jnd0TVm4Z51k2ACfRC0k
m7hToedec/16kKqNgpu6kd0=
=Jx2c
-END PGP SIGNATURE-



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#538376: python-debian: debian_bundle parses some multilines fields inconsistantly

2009-07-25 Thread John Wright
On Sat, Jul 25, 2009 at 05:02:16PM +0200, John Wright wrote:
 On Sat, Jul 25, 2009 at 02:05:26PM +0200, sean finney wrote:
  severity 538376 normal
  thanks
  
  okay i take back what i said about this being a regression, it seems that
  in previous versions ( 0.1.10) it was treated as a plain text field (i.e. 
  no dictionary at all), which meant that the fields were probably ignored
  entirely in the patch-tracker but are now partially showing up.
  
  still a bug though afaict :)
 
 Gah!  This is apt_pkg's fault: it strips off the leading '\n'.  This
 means our goal of having the output match the input will in general be
 broken when you use apt_pkg.  (Right now, the only place that's used by
 default is when you use iter_paragraphs.)
 
 A temporary workaround is to pass use_apt_pkg=False to the
 iter_paragraphs method.  On large files, you'll probably notice a bit of
 a performance hit.  I'll see what we can do to preserve this information
 with apt_pkg.

I'm actually inclined to turn off using apt_pkg by default.  It's
definitely faster, typically by a factor between 2 and 2.5, but we keep
running into weird corner cases with the way apt_pkg parses things.

Using sid's Sources and amd64 Packages files, calling the respective
class's iter_paragraph method with the specified kwargs and throwing
away the results, like

for d in cls.iter_paragraphs(f, **kwargs): pass

I get the following run times:


Packages, class 'debian_bundle.deb822.Packages', {'use_apt_pkg': True}
0: 0:00:08.664978
1: 0:00:07.747378
2: 0:00:07.743156
3: 0:00:07.961919
4: 0:00:07.758220
Average: 0:00:07.975130

Packages, class 'debian_bundle.deb822.Packages', {'use_apt_pkg': False}
0: 0:00:18.505047
1: 0:00:18.179216
2: 0:00:18.179558
3: 0:00:18.415705
4: 0:00:18.182857
Average: 0:00:18.292476

Sources, class 'debian_bundle.deb822.Sources', {'use_apt_pkg': True}
0: 0:00:07.865666
1: 0:00:07.864537
2: 0:00:07.861713
3: 0:00:07.873949
4: 0:00:07.858093
Average: 0:00:07.864791

Sources, class 'debian_bundle.deb822.Sources', {'use_apt_pkg': False}
0: 0:00:13.710405
1: 0:00:13.262080
2: 0:00:13.260217
3: 0:00:13.245185
4: 0:00:13.251963
Average: 0:00:13.345970

Packages, class 'debian_bundle.deb822.Deb822', {'use_apt_pkg': True}
0: 0:00:06.283796
1: 0:00:06.414739
2: 0:00:06.323466
3: 0:00:06.320447
4: 0:00:06.264290
Average: 0:00:06.321347

Packages, class 'debian_bundle.deb822.Deb822', {'use_apt_pkg': False}
0: 0:00:16.653596
1: 0:00:16.637927
2: 0:00:16.805496
3: 0:00:16.631162
4: 0:00:16.614459
Average: 0:00:16.668528

Packages, class 'debian_bundle.deb822.Deb822', {'use_apt_pkg': True, 
'shared_storage': True}
0: 0:00:02.513985
1: 0:00:02.516300
2: 0:00:02.521496
3: 0:00:02.514919
4: 0:00:02.516548
Average: 0:00:02.516649


Clearly, using shared storage (which basically just means using
apt_pkg's parser.Section directly) is blazingly fast compared to
without.  But this is aready not the default, since it has the confusing
side-effect of making the object returned by each iteration (each of
which has a different id) actually share the same data.

Is anybody strongly opposed to making iter_paragraphs not use apt_pkg by
default?  I'm still trying to figure out a way to salvage the output
from apt_pkg in this case, but I'm not having much luck.

-- 
John Wright j...@debian.org



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org