I almost never use this script below, but I think
I may start using modified copies of it.  The
broken link referred to in the comment section below
is linked to the text "*Spatial Evacuation Analysis Project <http://www.ncgia.ucsb.edu/%7Ecova/seap.html>"
on *the webpage
http://www.ncgia.ucsb.edu/about/sitemap.php*

*The program apparently skips the broken link and
probably a lot of other links.  Maybe because they are
relative links??  I could probably figure this out,
but just haven't worked on it much yet.*
*

I found this script.  I did not create it.

#!/usr/local/bin/perl
#
# This program crawls sites listed in URLS and checks
# all links.  But it does not crawl outside the base
# site listed in FOLLOW_REGEX.  It lists all the links
# followed, including the broken links.  All output goes
# to the terminal window.
#
# I say this does not work, because the link http://www.ncgia.ucsb.edu/~cova/seap.html
# is broken on this page: http://www.ncgia.ucsb.edu/about/sitemap.php
# but this script does not point that out.
#
#
use strict;
use warnings;
use WWW::SimpleRobot;
my $robot = WWW::SimpleRobot->new(
    URLS            => [ 'http://www.ncgia.ucsb.edu/about/sitemap.php' ],
    FOLLOW_REGEX    => "^http://www.ncgia.ucsb.edu/";,
    DEPTH           => 1,
    TRAVERSAL       => 'depth',
    VISIT_CALLBACK  =>
        sub {
            my ( $url, $depth, $html, $links ) = @_;
            print STDERR "\nVisiting $url\n\n";
            foreach my $link (@$links){
                print STDERR "@{$link}\n"; # This derefereces the links
            }
        }

    ,
    BROKEN_LINK_CALLBACK  =>
        sub {
            my ( $url, $linked_from, $depth ) = @_;
print STDERR "$url looks like a broken link on $linked_from\n";
            print STDERR "Depth = $depth\n";
        }
);
$robot->traverse;
my @urls = @{$robot->urls};
my @pages = @{$robot->pages};
for my $page ( @pages )
{
    my $url = $page->{url};
    my $depth = $page->{depth};
    my $modification_time = $page->{modification_time};
}

print "\nAll done.\n";


__END__




On 10/23/2013 1:48 PM, G. Wade Johnson wrote:
Hi Mike, Thanks for the input. I'm glad you have been able to get input from other resources. I hope this list and the hangout will become more useful to you as well.
We have had some of these topics covered in the past, so the talks
pages may have some information that will help. My goal with this is
really to help people get unstuck and see how to proceed, rather than
teaching.

For example, if you had a particular task you wanted to perform with
LWP (even if it's an example problem), we could walk through where you
are stuck and get you moving again. Also, we could answer questions on
the modules that we know.

It sounds like what I have in mind could be useful to you as well.

G. Wade


_______________________________________________
Houston mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/houston
Website: http://houston.pm.org/

Reply via email to