minor changes in a parser

Martin Kaspar Fri, 10 Nov 2017 12:02:10 -0800

hello dear perl-experts,


I'm pretty new to Programming and OO programming especially.
Nonetheless, I'm trying to get done a very simple Spider for web crawling.

the script below - is what i got to work

it runs nicely : now i want to modify the script a bit - tailoring and
tinkering is the way to learn. I want to fetch urls with a certain content
in the URL-string

"http://www.foo.com/bar";


in other words: what is aimed, i need to fetch all the urls that contains
the term " /bar"
- then i want to extract the "bar" so that it remains the url:
http://www.foo.com
-


is this doable?

love to hear from you
Martin




#!C:\Perl\bin\perl

use strict; # You always want to include both strict and warnings
use warnings;

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

# There was no reason for this to be in a BEGIN block (and there
# are a few good reasons for it not to be)
open my $file1,"+>>", ("links.txt");
select($file1);

#The Url I want it to start at;
# Note that I've made this an array, @urls, rather than a scalar, $URL
my @urls = ('https://the url goes in here');
my %visited;  # The % sigil indicates it's a hash
my $browser = LWP::UserAgent->new();
$browser->timeout(5);

while (@urls) {
  my $url = shift @urls;

  # Skip this URL and go on to the next one if we've
  # seen it before
  next if $visited{$url};

  my $request = HTTP::Request->new(GET => $url);
  my $response = $browser->request($request);

  # No real need to invoke printf if we're not doing
  # any formatting
  if ($response->is_error()) {print $response->status_line, "\n";}
  my $contents = $response->content();

  # Now that we've got the url's content, mark it as
  # visited
  $visited{$url} = 1;

  my ($page_parser) = HTML::LinkExtor->new(undef, $url);
  $page_parser->parse($contents)->eof;
  my @links = $page_parser->links;

  foreach my $link (@links) {
    print "$$link[2]\n";
    push @urls, $$link[2];
  }
  sleep 60;
}



On Wed, Oct 4, 2017 at 10:49 PM, Dan Book <gri...@gmail.com> wrote:

> How can we proceed from here?
> -Dan
>
> On Mon, Sep 18, 2017 at 1:17 PM, Patrick M. Galbraith <p...@patg.net>
> wrote:
>
>> Pali,
>>
>> Great! Now we can start moving forward.
>>
>> Sorry if my responses have been intermittent - first week at new job.
>>
>> Regards,
>>
>> Patrick
>> On 9/16/17 4:35 AM, p...@cpan.org wrote:
>>
>> I prepared branch master-new, which is based on current DBD-mysql master
>> branch and revert state to pre-4.043 version, including all changes done
>> after 4.043 release to master branch. I have this master-new branch in
>> my fork. If you want you can use it...
>> https://github.com/pali/DBD-mysql/tree/master-new
>>
>>


-- 

  <http://www.facebook.com/martin.kaspar.547>  [image:
gplus_Seiten_Signatur] <https://plus.google.com/u/0/104428351748591530426>

minor changes in a parser

Reply via email to