On 16 September 2010 10:20, Brian Moore <[email protected]> wrote:
>
> Well to be fair, the php script that created the index page had totally
> different HTML than the apache index. So it may be that the regex you
> mentioned before passed, but it then got hung up somewhere else.
>
> In any case, I switched it back to the default lighttpd formate.
>
Thanks -- I've patched it (I hope) and your mirror is being scanned properly
now. We'll see how it goes, and hopefully my patching is correct :)
Next: Hiawatha :-/
Cheers,
~p
--- scanner.ORIG 2010-09-16 11:02:56.000000000 +1000
+++ scanner.NEW 2010-09-16 12:03:27.000000000 +1000
@@ -607,7 +607,58 @@
if($do_transaction) {
$dbh->commit or die "$DBI::errstr";
}
- }
+ # lighttpd: <thead><tr><th class="n">Name</th><th class="m">Last Modified</th><th class="s">Size</th><th class="t">Type</th></tr></thead>
+ } elsif($contents =~ s{^.*<thead>.*>Name<.*<tbody>}{}s) {
+ ## Oh look, it's a lighttpd directory index!
+ $contents =~ s{</tbody>.*$}{}s;
+ for my $line (split "\n", $contents) {
+ $line =~ s/<\/*t[rd].*?>/ /g;
+ print "$identifier: line: $line\n" if $verbose > 2;
+ if($line =~ m{^(.*)[Hh][Rr][Ee][Ff]="([^"]+)">([^<]+)</[Aa]>.+([\w\s:-]+)\s+(-|[\d\.]+[KMG]?)}) {
+ my ($pre, $name1, $name2, $date, $size) = ($1, $2, $3, $4, $5);
+ next if $name1 =~ m{^/} or $name1 =~ m{^\.\.};
+ if($verbose > 2) {
+ print "$identifier: pre $pre\n";
+ print "$identifier: name1 $name1\n";
+ print "$identifier: name2 $name2\n";
+ print "$identifier: date $date\n";
+ print "$identifier: size $size\n";
+ }
+ $name1 =~ s{%([\da-fA-F]{2})}{pack 'c', hex $1}ge;
+ $name1 =~ s{^\./}{};
+ my $dir = 1 if $pre =~ m{>Directory<};
+ my $t = length($name) ? "$name/$name1" : $name1;
+ if($size eq '-' and ($dir or $name1 =~ m{/$})) {
+ ## we must be really sure it is a directory, when we come here.
+ ## otherwise, we'll retrieve the contents of a file!
+ sleep($recursion_delay) if $recursion_delay;
+ push @r, http_readdir($identifier, $id, $urlraw, $t, 0);
+ }
+ else {
+ ## it is a file.
+ my $time = $date;
+ my $len = byte_size($size);
+
+ # str2time returns undef in some rare cases causing KILL! FIXME
+ # workaround: don't store files with broken times
+ if(not defined($time)) {
+ print "$identifier: Error: str2time returns undef on parsing \"$date\". Skipping file $name1\n";
+ print "$identifier: current line was:\n$line\nat url $url/$name\nname= $name1\n" if $verbose > 1;
+ }
+ elsif(largefile_check($identifier, $id, $t, $len)) {
+ #save timestamp and file in database
+ if(save_file($t, $identifier, $id, $time, $re)) {
+ push @r, [ $t , $time ];
+ }
+ }
+ }
+ }
+ }
+ print "$identifier: committing http dir $name\n" if $verbose > 2;
+ if($do_transaction) {
+ $dbh->commit or die "$DBI::errstr";
+ }
+ }
else {
## we come here, whenever we stumble into an automatic index.html
$contents = substr($contents, 0, 500);
_______________________________________________
ArchServer Project General Mail List
Post messages to: [email protected]
Administer your subscription: http://lists.archserver.org/listinfo/general