On 2012-07-02 03:25, Colin Watson wrote:
> On Sun, Jun 17, 2012 at 02:15:31PM +0200, Niels Thykier wrote:
>> I noticed [1] and decided to check what made Lintian a "lengthy
>> invariant".  Processing the changes (and related files) took about a
>> minute (accoriding to the shell built-in time).  Running:
>>
>>  $ time lintian -d -C manpages allegro4-doc_4.4.2-2_all.deb
>>
>> takes about 40 seconds.
>>
>> The bottleneck appears to our calls to "man" in checks/manpages.
>> Manually running man on all the manpages takes roughly 30 seconds.  As
>> far as I can tell, man is "just slow" (at least with currently
>> selected options).
> 
> A good deal of this is just death-by-a-thousand-cuts rather than any
> single thing being desperately slow; it's not unreasonably slow for
> interactive use, but it's being run 823 times here, and it has to spawn
> a lot of subprocesses because the full warnings check necessarily
> involves invoking nroff, which isn't lightweight.
> 

Right, a better start would have been "our use of man is unreasonably
slow".  But thanks for having a look at this.

> I've never attempted to optimise the manpages check before, though, and
> so there's some scope for easy improvements: each subprocess is
> expensive when you multiply them up, so let's look at which ones are
> obviously unnecessary.  (I can't get any accurate timings just now
> because my backups are running.)
> 

My runs (with -dd) shows that the check takes 42-43 seconds currently on
allegro4-doc/2_4.4.2-2.1.  If I replace the man exec with _exit(0), the
runtime drops to 12-13 seconds.

> Setting MANROFFSEQ to empty in the environment would get rid of a call
> to tbl for most pages; this would mean that lintian is stricter about
> pages declaring their preprocessors with '\" lines (i.e.  pages that
> need tbl would have to say  '\" t  at the top), but as long as we
> document this in the info text for the relevant check I would say that a
> bit of extra strictness is perfectly acceptable in the context of
> lintian, certainly if it comes with a performance advantage.
> 

This did not show any visible difference with it; maybe the manpages
already invoke those preprocessors?  But if it makes us stricter, it
will probably still be worth it.

> Adding the '-Tutf8 -Z' options to man would cause it to only run pages
> through the parsing half of the groff pipeline, and not bother with
> formatting them for display using grotty or processing the output
> through col.
> 

This took off about 7 seconds off (putting the run on about 35 seconds
for me).

> On the lintian side, it would be worth taking some steps to avoid
> running commands using the shell (e.g. the list forms of open and exec
> with some manual redirections).

Not a visible improvement in runtime, but seems like a good idea anyway.

>  Each one doesn't take very long but
> they add up.  Also, we might as well use 'gzip -cd' directly rather than
> running through the zcat wrapper script every time.
> 

Using open_gz from Lintian::Util takes another 4 seconds off (with
libperlio-gzip-perl installed).  With this, we are at approximately 31
seconds.  Nice... :)

I did not try to uninstall libperlio-gzip-perl, so it is possible the
improvement is less visible without it.

> How far does all this get you?  Given the current timings, I'd have
> thought that even fractional improvements would be worthwhile.
> 
>> Running man in a collection is unlikely to yield any noticable
>> improvement[2].  Even with xargs we are looking at at least 25 seconds
>> plus man is unhelpful in this case[3].
> [...]
>> [3] It emits errors when running with xargs that do not occur when
>> running them in serial.
> 
> Can you give me an example yielding such a difference?
> 

These two do give the difference:
find usr/share/man/ -type f | xargs  man -E UTF-8 -l >/dev/null
find usr/share/man/ -type f -exec  man -E UTF-8 -l  {} \; >/dev/null

The errors were:
 <standard input>:31: warning [p 1, 5.7i]: cannot adjust line

However when passing -Tutf8 -Z, they seem to behave identically, so I
guess it is not relevant.

>> The error messages all use "<standard input>" rather than a filename,
>> so it will be... difficult to relate them to the original manpage.
> 
> Indeed.  This is really groff being unhelpful, not man; convincing groff
> to output a more useful file name would appear to require man to write
> out a temporary file, which wouldn't be terribly clever for I/O.  I
> suppose we could have man postprocess groff's error messages, or write
> out a status line at the start of processing each file so that lintian
> could know what "<standard input>" following that line means, or
> something like that.
> 

Yeah, except at this point the difference between find -exec vs find |
xargs is down to about 2 seconds.  So the extra trouble is giving less
with your improvements. :)
  That said, it might still make sense to move this to a collection to
get the benefit of parallelization.

Anyhow, the sum of my changes are attached in man.diff.

~Niels

Attachment: man.diff
Description: application/wine-extension-patch

Reply via email to