Hey guys - I've got an idea for a new feature I'd like to add to wget.
I'd like a way to specify a program to be run that can filter the URLs
just before they are fetched.  I'd like this so that I could use wget to
do recursive retrievals against Google's web cache.  This would be useful
for restoring deleted web sites, reading sites under heavy load, etc.
Something like this was my first shot:

   wget -r "http://www.google.com/search?q=cache:www.tregar.com/"

That works fine for the first page but the page that comes back contains
links that refer to www.tregar.com, not Google's cache.  My solution,
given the proposed feature, would be something like:

   wget -r --url-filter=google.pl \
      http://www.google.com/search?q=cache:www.tregar.com/"

Where google.pl would be something like (assuming the url comes in
through STDIN and goes out through STDOUT and minus error checking):

   #!/usr/bin/perl
   while(<STDIN>) {
        s!^http://!!;
        print "http://www.google.com/search?q=cache:$_\n";

Another possible implementation would be to include a regex engine in wget
and allow the user to specify the filter as a regex.  This obviously makes
for less powerful filters but might be more UNIXy.

Reactions?

-sam


Reply via email to