Colin Watson:
> On Tue, Aug 06, 2019 at 03:53:00PM +0000, Niels Thykier wrote:
>> On Wed, 31 Jul 2019 19:41:51 +0200 Robert Luberda <[email protected]> wrote:
>>> While working on my manpages-pl source package, I've noticed that the
>>> dh_installman step takes more time to execute than all other build steps
>>> together.
>>>
>>> This poor performance is caused by recoding all (i.e. about 1500 in case
>>> of manpages-pl) man pages into UTF-8, what is pretty much useless in case
>>> of my package, because the pages are in UTF-8 already.
>>>
>>> It would be nice if dh_installman could have some option to disable
>>> recoding
>>> or if it could at least filter the manpages to recode with `isutf8 -l'
>>> or similar command. (I've just checked that
>>> 'isutf8 -l debian/tmp/usr/share/man/pl/man*/*' inside the package is really
>>> quick to determine that all files are in UTF-8).
>>
>> Is there some way to trivially detect if the manpages need re-encoding
>> (without pulling moreutils as dependency or re-implementing the relevant
>> code in Perl)? Like some troff-ish rune in the early part of the file
>> that says "this file is definitely UTF-8" or something like that?
>
> man itself has code to do that, but it's not trivial and I'd hate to see
> it reimplemented in more places.
>
Fair enough.
> The actual recoding bit of "man --recode" is already practically a no-op
> if both source and target are UTF-8: it takes less than a millisecond to
> decide that the page is likely to be UTF-8, and then it just passes the
> source through to the target. However, based on a quick estimate from
> strace output for a small page, about 98% of the wallclock time is spent
> on process setup (initial memory allocation, parsing the configuration
> file, checking the manpath, and such). So I think a far more obvious
> optimisation target would be to add a mode to man where it could recode
> a batch of pages rather than just one at a time, in order that we'd only
> have to incur that setup cost once (or at least once per xargs batch).
Ok. That would work for me as well.
> Making that parallel would take some more work, but honestly, since this
> approach would probably give something like a 40x speedup, I'm not sure
> that'd be necessary.
>
debhelper can already do macro parallelization by running multiple
instances of the tool in parallel (we already do that in dh_installman
btw., so that part would not be a lot of additional effort).
> Does this make sense to you? If so, do you have any opinions on the
> interface? (I'm open to it being a new program rather than having to
> stuff even more complexity into man's command-line interface, which
> would also make it easy to detect whether the new interface is
> available.)
>
So basically, I envision something like any of the 4 following usage
patterns (as a starting point):
find ... -print0 | \
man-bulk-tool -l --recode UTF-8 --suffix .dh-new \
--null --files-from -
# Manually post-processing in form of mv foo.dh-new foo
find ... -print0 | \
xargs -0r man-bulk-tool -l --recode UTF-8 --suffix .dh-new
find ... -print0 | \
man-bulk-tool -l --recode UTF-8 --in-place --null \
--files-from -
# Manually post-processing in form of mv foo.dh-new foo
find ... -print0 | \
xargs -0r man-bulk-tool -l --recode UTF-8 --in-place
Either of them will work equally for me.
In all cases, I assume that compression is retained (e.g. if the file
was gzip compressed, then the output file should be as well. This is an
assumption in the current dh_installman as well)
For the --suffix variants: This matches what debhelper is currently
doing and was a "minimal effort opportunistic" safe-guard to catch
race-condition bugs if someone were to run two distinct dh_installman
processes on the same package as the same time (presumably
unintentionally, as it is at best a waste). It was introduced when I
added support for running recoding in parallel in dh_installman.
The tool could be man itself with a special flag. Though I can
appreciate your comment about man's command-line interface being
complex, so it might indeed be better to do it as a separate tool for
that reason alone.
Thanks,
~Niels