That was fun, and it gave me a good excuse to play with Devel::hdb. There's a bug in the way that WWW::SimpleRobot handles broken links.
If the link is in the original array that you pass, it recognizes the broken link and calls the callback routine. But, when it's traversing a page and building a list of links, it discards any link that fails a "head" request. So, all broken links would be discarded. That's probably worth a bug report to the author. More Detail ----------- To troubleshoot this, I first ran it the way you did. Then, I looked at the docs for WWW::SimpleRobot and didn't see anything useful there. Next, I looked at the source (nicely formatted by metacpan: https://metacpan.org/source/AWRIGLEY/WWW-SimpleRobot-0.07/SimpleRobot.pm). On line 35, I noticed there was an ability to do a VERBOSE mode. Looking down the code a little ways (lines 119-124), you can see that verbose is used to print a "get $url" line before the BROKEN_LINK_CALLBACK is called. Running that way showed that the code never prints "get http://www.ncgia.ucsb.edu/%7Ecova/seap.html". Looking a little further shows lines 140-142, which discards the link if head() fails. The hdb debugging interface was really nice for this. (Unfortunately, I spent a fair amount of time playing with the debugger.<shrug/>) I can see a couple of ways of fixing this: 1. Easiest: report the bug through RT and hope the author takes care of it soon. 2. Patch your copy of WWW::SimpleRobot code to call the callback at the head() failure or not to discard on the head() request. 3. Copy the WWW::SimpleRobot traversal code into your script and fix it there. The first approach is probably the best. G. Wade On Thu, 24 Oct 2013 05:03:40 -0500 Mike Flannigan <[email protected]> wrote: > > I almost never use this script below, but I think > I may start using modified copies of it. The > broken link referred to in the comment section below > is linked to the text "*Spatial Evacuation Analysis Project > <http://www.ncgia.ucsb.edu/%7Ecova/seap.html>" > on *the webpage > http://www.ncgia.ucsb.edu/about/sitemap.php* > > *The program apparently skips the broken link and > probably a lot of other links. Maybe because they are > relative links?? I could probably figure this out, > but just haven't worked on it much yet.* > * > > I found this script. I did not create it. > > #!/usr/local/bin/perl > # > # This program crawls sites listed in URLS and checks > # all links. But it does not crawl outside the base > # site listed in FOLLOW_REGEX. It lists all the links > # followed, including the broken links. All output goes > # to the terminal window. > # > # I say this does not work, because the link > http://www.ncgia.ucsb.edu/~cova/seap.html > # is broken on this page: http://www.ncgia.ucsb.edu/about/sitemap.php > # but this script does not point that out. > # > # > use strict; > use warnings; > use WWW::SimpleRobot; > my $robot = WWW::SimpleRobot->new( > URLS => > [ 'http://www.ncgia.ucsb.edu/about/sitemap.php' ], FOLLOW_REGEX => > "^http://www.ncgia.ucsb.edu/", DEPTH => 1, > TRAVERSAL => 'depth', > VISIT_CALLBACK => > sub { > my ( $url, $depth, $html, $links ) = @_; > print STDERR "\nVisiting $url\n\n"; > foreach my $link (@$links){ > print STDERR "@{$link}\n"; # This derefereces the > links } > } > > , > BROKEN_LINK_CALLBACK => > sub { > my ( $url, $linked_from, $depth ) = @_; > print STDERR "$url looks like a broken link on > $linked_from\n"; > print STDERR "Depth = $depth\n"; > } > ); > $robot->traverse; > my @urls = @{$robot->urls}; > my @pages = @{$robot->pages}; > for my $page ( @pages ) > { > my $url = $page->{url}; > my $depth = $page->{depth}; > my $modification_time = $page->{modification_time}; > } > > print "\nAll done.\n"; > > > __END__ > > > > > On 10/23/2013 1:48 PM, G. Wade Johnson wrote: > > Hi Mike, Thanks for the input. I'm glad you have been able to get > > input from other resources. I hope this list and the hangout will > > become more useful to you as well. > > We have had some of these topics covered in the past, so the talks > > pages may have some information that will help. My goal with this is > > really to help people get unstuck and see how to proceed, rather > > than teaching. > > > > For example, if you had a particular task you wanted to perform with > > LWP (even if it's an example problem), we could walk through where > > you are stuck and get you moving again. Also, we could answer > > questions on the modules that we know. > > > > It sounds like what I have in mind could be useful to you as well. > > > > G. Wade > > > -- We've all heard that a million monkeys banging on a million typewriters will eventually reproduce the works of Shakespeare. Now, thanks to the Internet, we know this is not true. -- Robert Wilensky, UCB _______________________________________________ Houston mailing list [email protected] http://mail.pm.org/mailman/listinfo/houston Website: http://houston.pm.org/
