Colin Watson:
> On Tue, Aug 06, 2019 at 03:53:00PM +0000, Niels Thykier wrote:
>> On Wed, 31 Jul 2019 19:41:51 +0200 Robert Luberda <[email protected]> wrote:
>>> While working on my manpages-pl source package, I've noticed that the
>>> dh_installman step takes more time to execute than all other build steps 
>>> together.
>>>
>>> This poor performance is caused by recoding all (i.e. about 1500 in case
>>> of manpages-pl) man pages into UTF-8, what is pretty much useless in case 
>>> of my package, because the pages are in UTF-8 already. 
>>>
>>> It would be nice if dh_installman could have some option to disable 
>>> recoding 
>>> or if it could at least filter the manpages to recode with `isutf8 -l'
>>> or similar command. (I've just checked that 
>>> 'isutf8 -l debian/tmp/usr/share/man/pl/man*/*' inside the package is really
>>> quick to determine that all files are in UTF-8).
>>
>> Is there some way to trivially detect if the manpages need re-encoding
>> (without pulling moreutils as dependency or re-implementing the relevant
>> code in Perl)?  Like some troff-ish rune in the early part of the file
>> that says "this file is definitely UTF-8" or something like that?
> 
> man itself has code to do that, but it's not trivial and I'd hate to see
> it reimplemented in more places.
> 

Fair enough.

> The actual recoding bit of "man --recode" is already practically a no-op
> if both source and target are UTF-8: it takes less than a millisecond to
> decide that the page is likely to be UTF-8, and then it just passes the
> source through to the target.  However, based on a quick estimate from
> strace output for a small page, about 98% of the wallclock time is spent
> on process setup (initial memory allocation, parsing the configuration
> file, checking the manpath, and such).  So I think a far more obvious
> optimisation target would be to add a mode to man where it could recode
> a batch of pages rather than just one at a time, in order that we'd only
> have to incur that setup cost once (or at least once per xargs batch).

Ok.  That would work for me as well.

> Making that parallel would take some more work, but honestly, since this
> approach would probably give something like a 40x speedup, I'm not sure
> that'd be necessary.
> 

debhelper can already do macro parallelization by running multiple
instances of the tool in parallel (we already do that in dh_installman
btw., so that part would not be a lot of additional effort).

> Does this make sense to you?  If so, do you have any opinions on the
> interface?  (I'm open to it being a new program rather than having to
> stuff even more complexity into man's command-line interface, which
> would also make it easy to detect whether the new interface is
> available.)
> 

So basically, I envision something like any of the 4 following usage
patterns (as a starting point):


  find ... -print0 | \
     man-bulk-tool -l --recode UTF-8 --suffix .dh-new \
       --null --files-from -
  # Manually post-processing in form of mv foo.dh-new foo


  find ... -print0 | \
    xargs -0r man-bulk-tool -l --recode UTF-8 --suffix .dh-new


  find ... -print0 | \
     man-bulk-tool -l --recode UTF-8 --in-place --null \
       --files-from -
  # Manually post-processing in form of mv foo.dh-new foo


  find ... -print0 | \
    xargs -0r man-bulk-tool -l --recode UTF-8 --in-place

Either of them will work equally for me.

In all cases, I assume that compression is retained (e.g. if the file
was gzip compressed, then the output file should be as well.  This is an
assumption in the current dh_installman as well)

For the --suffix variants: This matches what debhelper is currently
doing and was a "minimal effort opportunistic" safe-guard to catch
race-condition bugs if someone were to run two distinct dh_installman
processes on the same package as the same time (presumably
unintentionally, as it is at best a waste).  It was introduced when I
added support for running recoding in parallel in dh_installman.

The tool could be man itself with a special flag.  Though I can
appreciate your comment about man's command-line interface being
complex, so it might indeed be better to do it as a separate tool for
that reason alone.

Thanks,
~Niels

Reply via email to