On 16 September 2010 10:20, Brian Moore <[email protected]> wrote:

>
> Well to be fair, the php script that created the index page had totally
> different HTML than the apache index.  So it may be that the regex you
> mentioned before passed, but it then got hung up somewhere else.
>
> In any case, I switched it back to the default lighttpd formate.
>

Thanks -- I've patched it (I hope) and your mirror is being scanned properly
now. We'll see how it goes, and hopefully my patching is correct :)

Next: Hiawatha :-/

Cheers,
~p
--- scanner.ORIG	2010-09-16 11:02:56.000000000 +1000
+++ scanner.NEW	2010-09-16 12:03:27.000000000 +1000
@@ -607,7 +607,58 @@
     if($do_transaction) {
       $dbh->commit or die "$DBI::errstr";
     }
-  }
+  # lighttpd:	<thead><tr><th class="n">Name</th><th class="m">Last Modified</th><th class="s">Size</th><th class="t">Type</th></tr></thead>
+  } elsif($contents =~ s{^.*<thead>.*>Name<.*<tbody>}{}s) {
+    ## Oh look, it's a lighttpd directory index!
+    $contents =~ s{</tbody>.*$}{}s;
+    for my $line (split "\n", $contents) {
+      $line =~ s/<\/*t[rd].*?>/ /g;
+      print "$identifier: line: $line\n" if $verbose > 2;
+      if($line =~ m{^(.*)[Hh][Rr][Ee][Ff]="([^"]+)">([^<]+)</[Aa]>.+([\w\s:-]+)\s+(-|[\d\.]+[KMG]?)}) {
+        my ($pre, $name1, $name2, $date, $size) = ($1, $2, $3, $4, $5);
+        next if $name1 =~ m{^/} or $name1 =~ m{^\.\.};
+        if($verbose > 2) {
+          print "$identifier: pre $pre\n";
+          print "$identifier: name1 $name1\n";
+          print "$identifier: name2 $name2\n";
+          print "$identifier: date $date\n";
+          print "$identifier: size $size\n";
+        }
+        $name1 =~ s{%([\da-fA-F]{2})}{pack 'c', hex $1}ge;
+        $name1 =~ s{^\./}{};
+        my $dir = 1 if $pre =~ m{>Directory<};
+        my $t = length($name) ? "$name/$name1" : $name1;
+        if($size eq '-' and ($dir or $name1 =~ m{/$})) {
+          ## we must be really sure it is a directory, when we come here.
+          ## otherwise, we'll retrieve the contents of a file!
+          sleep($recursion_delay) if $recursion_delay;
+          push @r, http_readdir($identifier, $id, $urlraw, $t, 0);
+        }
+        else {
+          ## it is a file.
+          my $time = $date;
+          my $len = byte_size($size);
+
+          # str2time returns undef in some rare cases causing KILL! FIXME
+          # workaround: don't store files with broken times
+          if(not defined($time)) {
+            print "$identifier: Error: str2time returns undef on parsing \"$date\". Skipping file $name1\n";
+            print "$identifier: current line was:\n$line\nat url $url/$name\nname= $name1\n" if $verbose > 1;
+          }
+          elsif(largefile_check($identifier, $id, $t, $len)) {
+            #save timestamp and file in database
+            if(save_file($t, $identifier, $id, $time, $re)) {
+              push @r, [ $t , $time ];
+            }
+          }
+        }
+      }
+    }
+    print "$identifier: committing http dir $name\n" if $verbose > 2;
+    if($do_transaction) {
+      $dbh->commit or die "$DBI::errstr";
+    }
+ }
   else {
     ## we come here, whenever we stumble into an automatic index.html 
     $contents = substr($contents, 0, 500);
_______________________________________________
ArchServer Project General Mail List
Post messages to: [email protected]
Administer your subscription: http://lists.archserver.org/listinfo/general

Reply via email to