Re: Web HTTP Robots

gray_s Tue, 16 May 2000 16:46:10 -0700

Hi, Don't know if this is _exactly_ what you are looking for, but I found it at perlmonks.org . "Works for me..." sG #!/usr/local/bin/perl -w use strict; use LWP::Simple; my $page = "http://url.to.search/"; my $doc = ""; my $url = ""; my $pattern = ""; my $foo = ""; my $moo = ""; my $visit = ""; my $print = ""; my @urls = ""; my @visit = ""; my @doc = ""; open (OUT, ">>LOG.borders") or die "Could not open file: $!"; open (VISIT, ">>LOG.visited.borders") or die "Could not open file: $!"; open (LOG, ">>LOG.urls.borders") or die "Could not open file: $!"; &get_urls; ##fetches and parses pages foreach $url(@urls){ $visit = join(' ', @visit); $visit =~ tr/\?/Q/; if ($visit !~ /($url)/i){ $url =~ tr/Q/\?/; push(@visit, $url); print VISIT "$url \n"; $page = $url; $print = get "$url"; print "Getting $url...\n"; &get_urls; foreach $pattern ("I_want_this_term", "\"THING B", "THING C", "THING D"){ if ($print =~ /($pattern)/i){ print OUT "$1, $url\n"; }; }; }; }; close (LOG); close (VISIT); close (OUT); print "\nDone!!!\n"; sub get_urls{ ##find all links within page $doc = get "$page"; @doc = split(/\s/, $doc); foreach $a (@doc){ if ($a =~ /href="(http:\/\/[^"]+)">/i){ #I needed the script to skip certain URLs #(to avoid unproductive spydering, among #other things.) The following hunklet of #code keeps an eye out for these. if ($1 !~ /WebGate|webspawner.com|webring.ne.jp|netscape.com|yahoo.com|#/i){ $foo = join(' ', @urls); $moo = "$1"; $moo =~ tr/\?/Q/; $foo =~ tr/\?/Q/; if ($foo !~ /($1)/i){ push(@urls, $moo); print LOG "$moo\n"; }; }; }; }; }; "Martin Moss" <[EMAIL PROTECTED]> on 2000/05/17 02:44:58 "Martin Moss" <[EMAIL PROTECTED]>に返信してください宛先: "Perl-Win32-Users Mailing List" <[EMAIL PROTECTED]> cc: (bcc: Atlas21 Shawn/OIPQA/Canon Inc/JP) 件名: Web HTTP Robots


Does anybody have any thoughts or recommendations on which of the various
www robot's modules that are around which would allow 'realtime saving of
state'. i.e. if the perl script crashes, and I start it over it'll start
from the same point it died from. I need to populate a database with the
given 'link' information, i.e. 'from link' to 'to link', so that I can then
run a reporting tool on the database.

Basically I need to create a database map of which links go to which page
for every page on an individual internet site. Speed is important, so whilst
the robot must be 'nice' it doesn't have to have the manners of a prince!
Also multiple robots can be operating on the site at any given time.  for
example one robot would stay within XXX.domain.com and another might map
YYY.domain.com however I may want more than one robot to work on
www.domain.com.



I have a linux box and database access at my desposal.


Regards

Marty




---
You are currently subscribed to perl-win32-users as: [EMAIL PROTECTED]
To unsubscribe, forward this message to
         [EMAIL PROTECTED]
For non-automated Mailing List support, send email to
         [EMAIL PROTECTED]

Re: Web HTTP Robots

Reply via email to