Does anybody have any thoughts or recommendations on which of the various www robot's modules that are around which would allow 'realtime saving of state'. i.e. if the perl script crashes, and I start it over it'll start from the same point it died from. I need to populate a database with the given 'link' information, i.e. 'from link' to 'to link', so that I can then run a reporting tool on the database. Basically I need to create a database map of which links go to which page for every page on an individual internet site. Speed is important, so whilst the robot must be 'nice' it doesn't have to have the manners of a prince! Also multiple robots can be operating on the site at any given time. for example one robot would stay within XXX.domain.com and another might map YYY.domain.com however I may want more than one robot to work on www.domain.com. I have a linux box and database access at my desposal. Regards Marty --- You are currently subscribed to perl-win32-users as: [EMAIL PROTECTED] To unsubscribe, forward this message to [EMAIL PROTECTED] For non-automated Mailing List support, send email to [EMAIL PROTECTED]
Hi,
Don't know if this is _exactly_ what you are looking for, but I found it at
perlmonks.org .
"Works for me..."
sG
#!/usr/local/bin/perl -w
use strict;
use LWP::Simple;
my $page = "http://url.to.search/";
my $doc = "";
my $url = "";
my $pattern = "";
my $foo = "";
my $moo = "";
my $visit = "";
my $print = "";
my @urls = "";
my @visit = "";
my @doc = "";
open (OUT, ">>LOG.borders") or die "Could not open file: $!";
open (VISIT, ">>LOG.visited.borders") or die "Could not open file: $!";
open (LOG, ">>LOG.urls.borders") or die "Could not open file: $!";
&get_urls;
##fetches and parses pages
foreach $url(@urls){
$visit = join(' ', @visit);
$visit =~ tr/\?/Q/;
if ($visit !~ /($url)/i){
$url =~ tr/Q/\?/;
push(@visit, $url);
print VISIT "$url \n";
$page = $url;
$print = get "$url";
print "Getting $url...\n";
&get_urls;
foreach $pattern ("I_want_this_term", "\"THING B", "THING C", "THING
D"){
if ($print =~ /($pattern)/i){
print OUT "$1, $url\n";
};
};
};
};
close (LOG);
close (VISIT);
close (OUT);
print "\nDone!!!\n";
sub get_urls{
##find all links within page
$doc = get "$page";
@doc = split(/\s/, $doc);
foreach $a (@doc){
if ($a =~ /href="(http:\/\/[^"]+)">/i){
#I needed the script to skip certain URLs
#(to avoid unproductive spydering, among
#other things.) The following hunklet of
#code keeps an eye out for these.
if ($1 !~
/WebGate|webspawner.com|webring.ne.jp|netscape.com|yahoo.com|#/i){
$foo = join(' ', @urls);
$moo = "$1";
$moo =~ tr/\?/Q/;
$foo =~ tr/\?/Q/;
if ($foo !~ /($1)/i){
push(@urls, $moo);
print LOG "$moo\n";
};
};
};
};
};
"Martin Moss" <[EMAIL PROTECTED]> on 2000/05/17 02:44:58
"Martin Moss" <[EMAIL PROTECTED]>に返信してください
宛先: "Perl-Win32-Users Mailing List" <[EMAIL PROTECTED]>
cc: (bcc: Atlas21 Shawn/OIPQA/Canon Inc/JP)
件名: Web HTTP Robots