Hi I have written a Proof of Concept patch for running file in parallel (in collection/file-info). The parallelism should probably only be run in some conditions (e.g. huge package etc). The patch will also remove the file-info for dirs, since it is unused and always "(setguid )?directory".
The rationale for the patch is that I did some trivial benchmarks to find our bottleneck(s). For the test I used the eclipse source package[1] and tmpfs. On my machine this results in lintian finishing its check after a 1 minute and 10-15 seconds, where most of this time (~1 minute) is spent running collections. The slowest two appeared to be unpacked (~11-12 seconds) and file-info (~52-54 seconds). The rest of the source collections are completed within 1 the same second they are started. Using this patch I can reduce file-info to about ~24 seconds[2]. The eclipse binary packages seem to be gaining next to nothing from this patch. I assume it has something to do with the source package containing over 38k files, while the binary packages "only" had 300-400ish files[3]. The numbers seems to hold even if I remove the tmpfs (within +/- 3 seconds). All timing was done with "time" (thus all numbers have a precision measurable in seconds). ~Niels [1] eclipse 3.7~exp-2.dsc Reason for choosing it: it was big and it was available! [2] The machine did have plenty cores and RAM to spare. [3] As determined by tar vjtf $file and dpkg --contents $file piped through wc -l. I only checked the largest source tarball and the largest binary package.
>From d3864e610edba18b25dbff2d8dc836b4cfc62fba Mon Sep 17 00:00:00 2001 From: Niels Thykier <[email protected]> Date: Wed, 31 Aug 2011 15:14:37 +0200 Subject: [PATCH] Parallelize file-info with up to 4 invocations --- collection/file-info | 35 ++++++++++++++++++++++++++--------- 1 files changed, 26 insertions(+), 9 deletions(-) diff --git a/collection/file-info b/collection/file-info index e61acb4..2a90959 100755 --- a/collection/file-info +++ b/collection/file-info @@ -22,7 +22,7 @@ use strict; use warnings; -use Cwd qw(realpath); +use Cwd qw(cwd realpath); use FileHandle; use lib "$ENV{'LINTIAN_ROOT'}/lib"; use Util; @@ -35,6 +35,7 @@ my $last = ''; my $helper = realpath("$0-helper"); my $outfile = realpath('./file-info'); +my $dir = cwd; unlink($outfile); @@ -48,28 +49,44 @@ open(INDEX, '<', 'index') chdir('unpacked') or fail("cannot chdir to unpacked directory: $!"); +my $i = 0; +my @jobs; +for ( ; $i < 4 ; $i++) { # We ignore failures from file because sometimes file returns a non-zero exit # status when it can't parse a file. So far, the resulting output still # appears to be usable (although will contain "ERROR" strings, which Lintian # doesn't care about), and the only problem was the exit status. -my %opts = ( pipe_in => FileHandle->new, - out => $outfile, - fail => 'never' ); -spawn(\%opts, ['xargs', '-0r', 'file', '-F', '', '--print0', '--'], '|', [$helper]); -$opts{pipe_in}->blocking(1); + my %opts = ( pipe_in => FileHandle->new, + out => "$outfile.$i", + fail => 'never' ); + spawn(\%opts, ['xargs', '-0r', 'file', '-F', '', '--print0', '--'], '|', [$helper]); + $opts{pipe_in}->blocking(1); + push @jobs, \%opts; +} + +$i = 0; while (<INDEX>) { chomp; + # skip directories as the output is uninteresting and not used anyway. + # (index has a type which is easier to check as well) + next if /^d/o; $_ = (split(' ', $_, 6))[5]; s/ link to .*//; s/ -> .*//; s/(\G|[^\\](?:\\\\)*)\\(\d{3})/"$1" . chr(oct $2)/ge; s/\\\\/\\/; - printf {$opts{pipe_in}} "%s\0", $_; + printf {$jobs[$i]->{pipe_in}} "%s\0", $_; + $i = ($i + 1) & 3; } + close(INDEX) or fail("cannot close index file: $!"); -close $opts{pipe_in}; -reap(\%opts); +foreach my $opts (@jobs) { + close $opts->{pipe_in}; + reap($opts); +} +system("cd \"$dir\" && cat file-info.* > file-info") == 0 or fail "cannot create $outfile"; + -- 1.7.5.4

