On Sat, Jul 25, 2009 at 05:02:16PM +0200, John Wright wrote: > On Sat, Jul 25, 2009 at 02:05:26PM +0200, sean finney wrote: > > severity 538376 normal > > thanks > > > > okay i take back what i said about this being a regression, it seems that > > in previous versions (< 0.1.10) it was treated as a plain text field (i.e. > > no dictionary at all), which meant that the fields were probably ignored > > entirely in the patch-tracker but are now partially showing up. > > > > still a bug though afaict :) > > Gah! This is apt_pkg's fault: it strips off the leading '\n'. This > means our goal of having the output match the input will in general be > broken when you use apt_pkg. (Right now, the only place that's used by > default is when you use iter_paragraphs.) > > A temporary workaround is to pass use_apt_pkg=False to the > iter_paragraphs method. On large files, you'll probably notice a bit of > a performance hit. I'll see what we can do to preserve this information > with apt_pkg.
I'm actually inclined to turn off using apt_pkg by default. It's definitely faster, typically by a factor between 2 and 2.5, but we keep running into weird corner cases with the way apt_pkg parses things. Using sid's Sources and amd64 Packages files, calling the respective class's iter_paragraph method with the specified kwargs and throwing away the results, like for d in cls.iter_paragraphs(f, **kwargs): pass I get the following run times: Packages, <class 'debian_bundle.deb822.Packages'>, {'use_apt_pkg': True} 0: 0:00:08.664978 1: 0:00:07.747378 2: 0:00:07.743156 3: 0:00:07.961919 4: 0:00:07.758220 Average: 0:00:07.975130 Packages, <class 'debian_bundle.deb822.Packages'>, {'use_apt_pkg': False} 0: 0:00:18.505047 1: 0:00:18.179216 2: 0:00:18.179558 3: 0:00:18.415705 4: 0:00:18.182857 Average: 0:00:18.292476 Sources, <class 'debian_bundle.deb822.Sources'>, {'use_apt_pkg': True} 0: 0:00:07.865666 1: 0:00:07.864537 2: 0:00:07.861713 3: 0:00:07.873949 4: 0:00:07.858093 Average: 0:00:07.864791 Sources, <class 'debian_bundle.deb822.Sources'>, {'use_apt_pkg': False} 0: 0:00:13.710405 1: 0:00:13.262080 2: 0:00:13.260217 3: 0:00:13.245185 4: 0:00:13.251963 Average: 0:00:13.345970 Packages, <class 'debian_bundle.deb822.Deb822'>, {'use_apt_pkg': True} 0: 0:00:06.283796 1: 0:00:06.414739 2: 0:00:06.323466 3: 0:00:06.320447 4: 0:00:06.264290 Average: 0:00:06.321347 Packages, <class 'debian_bundle.deb822.Deb822'>, {'use_apt_pkg': False} 0: 0:00:16.653596 1: 0:00:16.637927 2: 0:00:16.805496 3: 0:00:16.631162 4: 0:00:16.614459 Average: 0:00:16.668528 Packages, <class 'debian_bundle.deb822.Deb822'>, {'use_apt_pkg': True, 'shared_storage': True} 0: 0:00:02.513985 1: 0:00:02.516300 2: 0:00:02.521496 3: 0:00:02.514919 4: 0:00:02.516548 Average: 0:00:02.516649 Clearly, using shared storage (which basically just means using apt_pkg's parser.Section directly) is blazingly fast compared to without. But this is aready not the default, since it has the confusing side-effect of making the object returned by each iteration (each of which has a different id) actually share the same data. Is anybody strongly opposed to making iter_paragraphs not use apt_pkg by default? I'm still trying to figure out a way to salvage the output from apt_pkg in this case, but I'm not having much luck. -- John Wright <j...@debian.org> -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org