I see from researching the archives of this list that people have
succeeded in getting HTTP::Cookies to work with a login, along with
HTML::Form.
Maybe someone can suggest some methods to me here.
I will be grateful for any help I can get.
Thanks in advance,
Mike Clark
[EMAIL PROTECTED]
Toll Free 888 999 2181
Here is the project:
We have developed a simple perl spider which executes from a command prompt,
and it accepts a list of urls, then downloads the web pages to a directory.
I want to adapt it to spider from password-protected asp pages. We have
purchased a membership to this site, but the download is too slow.
When I login manually with a browser, it sets a cookie, then any time during
that browser session, any url entered separately in the location field will
work -- that is, it does not require a specific referer page, it only
requires the cookie. When I disable cookies in netscape, the login will not
work.
When I have a browser session open in internet explorer, any url pasted into
the location field will work, but if I open a new browser window, the pasted
url will not work in the new window. However, when I login with
user/password in a second browser window (either netscape or explorer), then
urls pasted into the second browser window do work.
Conclusion: it is reading the cookie from the specific browser session.
Project: first the script has to login and set the cookie, then it has to
download a list of urls from the site.
This is the login form:
<form name="login" method="post" action="main.asp" onSubmit="validate();" >
Enter Email ID <input type="text" name="email" size="25" maxlength="50"><br>
Enter Password <input type="password" name="password" size="25"
maxlength="30" ><br>
<input type ="hidden" name ="Browser" value ="">
<input type="hidden" name="submitted" value="Y">
<input type="submit" name="Login" value="Submit">
<SCRIPT language="JavaScript">
if (document.location.search == "?message=Y"){
document.write("ID/Password not found. Please register or try again.");
</script>
</form>
Here is the existing script:
#!/usr/bin/perl
require LWP::UserAgent;
require HTTP::Request;
require HTTP::Response;
use HTTP::Request::Common;
foreach (@ARGV)
{
if ( $_ eq $ARGV[0] )
{
$inputfile = $_;
}
elsif ( $_ eq $ARGV[1] )
{
$outdir = $ARGV[1];
}
else
{
die "Usage: $0 inputfile outdir\n";
}
}
print "Welcome\n";
print "Opening inputfile... ";
open (LINKFILE,"$inputfile") or die "Couldn't open the inputfile, $!";
@links = <LINKFILE>;
close(LINKFILE);
print "Sucess!\n";
# unless (-e $outdir){
# print "Directory doesn't exist... Creating\n";
# mkdir "$outdir", 755 or die "Couldn't make directory, $!";
# }
if(!opendir (OUTDIR, "$outdir")){
mkdir "$outdir",755;
print "Output directory created!\n";
}
else{print "Output directory exists!\n";}
print "Changing directory... ";
chdir "$outdir" or die "Couldn't change directory, $!";
print "Success!\n";
# Check to see if we hung up last time
# this doesn't resume, just warns you that it stopped somewhere
# in earlier versions of the program i had problems with the
# program hanging, but I don't know why.
if (-e "spiderlog.txt"){
open (LOG,"spiderlog.txt");
@spiderlog = reverse <LOG>;
close(LOG);
$lastline = chomp($spiderlog[0]);
if ($lastline ne "Done"){
print "Spider not finished... Last line in log says:
$lastline\n";
}
}
$filenum = 1;
$ua = new LWP::UserAgent;
$ua->agent('OurBot/1.0');
print "Start spidering process...\n\n";
$total = @links;
$start = time();
open (LOG,">>spiderlog.txt");
print LOG "Started at: $start\n\n";
foreach $line (@links){
print "Getting $line";
$response = $ua->request(GET $line);
if ($response->is_success) {
$content = $response->content;
if ($filenum =~ /\d\d\d\d/) {$filenum = $filenum; }
elsif ($filenum =~ /\d\d\d/) {$filenum = "0$filenum"; }
elsif ($filenum =~ /\d\d/) {$filenum = "00$filenum"; }
else {$filenum = "000$filenum"; }
open (NEWPAGE,">$filenum.html");
print NEWPAGE $response->content;
close (NEWPAGE);
print "$filenum.html generated\n\n";
print LOG "$filenum - $line";
$filenum++;
} else { print $response->error_as_HTML; }
}
$end = time();
$parse = $end - $start;
$parse = 1 unless($parse);
$lps = int($total/$parse);
print "$total lines in $parse seconds ($lps lines/sec)\n";
print LOG "$total lines in $parse seconds\nFinished at $end\nDone\n";
close (LOG);
print "clumping files... \n";
system "cat *.html > masterfile.htm";
print "Done!\n";