On 2011-08-31 15:52, Niels Thykier wrote: > [...] I have looked at this some more and the original patch can be improved. xargs have two arguments of interest; first the --max-args to keep it run in smaller batches. The second option is the --max-procs that makes xargs handle the parallelization. I strongly suspect that xargs does a much better job here than my previous patch. All in all, the file-info script (with --max-args=4 processes) is now down to ~15 seconds (from ~24) and the total unpack time for the eclipse source is about ~27 seconds[0]. The attached patch goes on top of my previous patch[1].
The downside of --max-procs is that the output from the sub-processes becomes garbled, so we have to manually write to separate files and merge the output. This is not difficult; my first approach is simply to use the pid to give them a unique file (using append to avoid truncating an existing file in case a pid is reused for a later process). This works great except those files add up... the lab ends up with (in my test) 500+ of small "file-info"-parts. Merging them is fairly trivial, but I do not like the all "parts". Do you have an idea to keep the parts down to a reasonable level? My best alternative is to make a "merging daemon" and have the file-info-helper processes feed it with their output. That would remove all the file parts, but at the price of complexity and an extra process. ~Niels [0] Only tested on tmpfs this time. [1] The missing 0002 is my "poor man's benchmark" code in frontend/lintian.
>From 4488929c17957cc76cc35dbb6dca7b182e4396a1 Mon Sep 17 00:00:00 2001 From: Niels Thykier <[email protected]> Date: Sat, 3 Sep 2011 10:14:42 +0200 Subject: [PATCH 3/3] Use xargs's parallization in coll/file-info with --max-args This is a couple of seconds faster on huge packages; xargs can do a better job at scheduling the individual runs. The downside is that the output becomes garbled (unless written to "per-process files" or similar). For huge packages the amount of files easily exceed 100 files. --- collection/file-info | 27 ++++++++------------------- collection/file-info-helper | 12 ++++++++++-- 2 files changed, 18 insertions(+), 21 deletions(-) diff --git a/collection/file-info b/collection/file-info index 2a90959..313e29a 100755 --- a/collection/file-info +++ b/collection/file-info @@ -49,22 +49,14 @@ open(INDEX, '<', 'index') chdir('unpacked') or fail("cannot chdir to unpacked directory: $!"); -my $i = 0; -my @jobs; -for ( ; $i < 4 ; $i++) { # We ignore failures from file because sometimes file returns a non-zero exit # status when it can't parse a file. So far, the resulting output still # appears to be usable (although will contain "ERROR" strings, which Lintian # doesn't care about), and the only problem was the exit status. - my %opts = ( pipe_in => FileHandle->new, - out => "$outfile.$i", - fail => 'never' ); - spawn(\%opts, ['xargs', '-0r', 'file', '-F', '', '--print0', '--'], '|', [$helper]); - $opts{pipe_in}->blocking(1); - push @jobs, \%opts; -} - -$i = 0; +my %opts = ( pipe_in => FileHandle->new, + fail => 'never' ); +spawn(\%opts, ['xargs', '-0r', '--max-procs=4', '--max-args=65', $helper, $outfile]); +$opts{pipe_in}->blocking(1); while (<INDEX>) { chomp; @@ -76,17 +68,14 @@ while (<INDEX>) { s/ -> .*//; s/(\G|[^\\](?:\\\\)*)\\(\d{3})/"$1" . chr(oct $2)/ge; s/\\\\/\\/; - printf {$jobs[$i]->{pipe_in}} "%s\0", $_; - $i = ($i + 1) & 3; + printf {$opts{pipe_in}} "%s\0", $_; } close(INDEX) or fail("cannot close index file: $!"); -foreach my $opts (@jobs) { - close $opts->{pipe_in}; - reap($opts); -} -system("cd \"$dir\" && cat file-info.* > file-info") == 0 or fail "cannot create $outfile"; +close $opts{pipe_in}; +reap(\%opts); +system("cd \"$dir\" && cat file-info.* > file-info") == 0 or fail "cannot create $outfile"; diff --git a/collection/file-info-helper b/collection/file-info-helper index 3c7bde0..f6583da 100755 --- a/collection/file-info-helper +++ b/collection/file-info-helper @@ -3,7 +3,13 @@ use strict; use warnings; -while ( my $line = <> ) { +my $ofile = shift; +$ofile .= ".$$"; +open my $out, '>>', $ofile or die "opening $ofile: $!"; + +open my $cmd, '-|', 'file', '-F', '', '--print0', '--', @ARGV or die "file: $!"; + +while ( my $line = <$cmd> ) { my ($file, $type) = $line =~ (m/^(.*?)\x00(.*)$/o); if ($file =~ m/\.gz$/o && -e $file && ! -l $file && $type !~ m/compressed/o){ # While file could be right, it is unfortunately @@ -30,6 +36,8 @@ while ( my $line = <> ) { } $type = "$type, $text" if $text; } - printf "%s%c%s\n", $file , 0, $type; + printf $out "%s%c%s\n", $file , 0, $type; } +close $cmd; +close $out or die "closing $ofile: $!"; -- 1.7.5.4

