On 2012-07-02 03:25, Colin Watson wrote: > On Sun, Jun 17, 2012 at 02:15:31PM +0200, Niels Thykier wrote: >> I noticed [1] and decided to check what made Lintian a "lengthy >> invariant". Processing the changes (and related files) took about a >> minute (accoriding to the shell built-in time). Running: >> >> $ time lintian -d -C manpages allegro4-doc_4.4.2-2_all.deb >> >> takes about 40 seconds. >> >> The bottleneck appears to our calls to "man" in checks/manpages. >> Manually running man on all the manpages takes roughly 30 seconds. As >> far as I can tell, man is "just slow" (at least with currently >> selected options). > > A good deal of this is just death-by-a-thousand-cuts rather than any > single thing being desperately slow; it's not unreasonably slow for > interactive use, but it's being run 823 times here, and it has to spawn > a lot of subprocesses because the full warnings check necessarily > involves invoking nroff, which isn't lightweight. >
Right, a better start would have been "our use of man is unreasonably slow". But thanks for having a look at this. > I've never attempted to optimise the manpages check before, though, and > so there's some scope for easy improvements: each subprocess is > expensive when you multiply them up, so let's look at which ones are > obviously unnecessary. (I can't get any accurate timings just now > because my backups are running.) > My runs (with -dd) shows that the check takes 42-43 seconds currently on allegro4-doc/2_4.4.2-2.1. If I replace the man exec with _exit(0), the runtime drops to 12-13 seconds. > Setting MANROFFSEQ to empty in the environment would get rid of a call > to tbl for most pages; this would mean that lintian is stricter about > pages declaring their preprocessors with '\" lines (i.e. pages that > need tbl would have to say '\" t at the top), but as long as we > document this in the info text for the relevant check I would say that a > bit of extra strictness is perfectly acceptable in the context of > lintian, certainly if it comes with a performance advantage. > This did not show any visible difference with it; maybe the manpages already invoke those preprocessors? But if it makes us stricter, it will probably still be worth it. > Adding the '-Tutf8 -Z' options to man would cause it to only run pages > through the parsing half of the groff pipeline, and not bother with > formatting them for display using grotty or processing the output > through col. > This took off about 7 seconds off (putting the run on about 35 seconds for me). > On the lintian side, it would be worth taking some steps to avoid > running commands using the shell (e.g. the list forms of open and exec > with some manual redirections). Not a visible improvement in runtime, but seems like a good idea anyway. > Each one doesn't take very long but > they add up. Also, we might as well use 'gzip -cd' directly rather than > running through the zcat wrapper script every time. > Using open_gz from Lintian::Util takes another 4 seconds off (with libperlio-gzip-perl installed). With this, we are at approximately 31 seconds. Nice... :) I did not try to uninstall libperlio-gzip-perl, so it is possible the improvement is less visible without it. > How far does all this get you? Given the current timings, I'd have > thought that even fractional improvements would be worthwhile. > >> Running man in a collection is unlikely to yield any noticable >> improvement[2]. Even with xargs we are looking at at least 25 seconds >> plus man is unhelpful in this case[3]. > [...] >> [3] It emits errors when running with xargs that do not occur when >> running them in serial. > > Can you give me an example yielding such a difference? > These two do give the difference: find usr/share/man/ -type f | xargs man -E UTF-8 -l >/dev/null find usr/share/man/ -type f -exec man -E UTF-8 -l {} \; >/dev/null The errors were: <standard input>:31: warning [p 1, 5.7i]: cannot adjust line However when passing -Tutf8 -Z, they seem to behave identically, so I guess it is not relevant. >> The error messages all use "<standard input>" rather than a filename, >> so it will be... difficult to relate them to the original manpage. > > Indeed. This is really groff being unhelpful, not man; convincing groff > to output a more useful file name would appear to require man to write > out a temporary file, which wouldn't be terribly clever for I/O. I > suppose we could have man postprocess groff's error messages, or write > out a status line at the start of processing each file so that lintian > could know what "<standard input>" following that line means, or > something like that. > Yeah, except at this point the difference between find -exec vs find | xargs is down to about 2 seconds. So the extra trouble is giving less with your improvements. :) That said, it might still make sense to move this to a collection to get the benefit of parallelization. Anyhow, the sum of my changes are attached in man.diff. ~Niels
man.diff
Description: application/wine-extension-patch