Re: minor changes in a parser

jerry Fri, 10 Nov 2017 11:59:52 -0800

This kind of stuff is trivial in Perl.  You've chosen a good language.


my $url =~ s|/bar$||;

...Which means: "Find any occurence of "/bar" at the very end of the URLand replace itwith a nothing. This is called a "regex" ( short for "regularexpression" ). We usually do regexes with forward slashes, but you canuse other characters ( like "|" when the target string contains forwardslashes.


  Do a web search for "Perl regex".

                - Jerry Kaidor





On 11/10/2017 06:00, Martin Kaspar wrote:

hello dear perl-experts,

I'm pretty new to Programming and OO programming especially.
Nonetheless, I'm trying to get done a very simple Spider for web
crawling.

the script below - is what i got to work

it runs nicely : now i want to modify the script a bit - tailoring and
tinkering is the way to learn. I want to fetch urls with a certain
content in the URL-string

"http://www.foo.com/bar";

in other words: what is aimed, i need to fetch all the urls that
contains the term " /bar"
- then i want to extract the "bar" so that it remains the url:
http://www.foo.com
-

is this doable?

love to hear from you
Martin

#!C:\Perl\bin\perl

use strict; # You always want to include both strict and warnings
use warnings;

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

# There was no reason for this to be in a BEGIN block (and there
# are a few good reasons for it not to be)
open my $file1,"+>>", ("links.txt");
select($file1);

#The Url I want it to start at;
# Note that I've made this an array, @urls, rather than a scalar, $URL

my @urls = ('https://the url goes in here');
my %visited;  # The % sigil indicates it's a hash
my $browser = LWP::UserAgent->new();
$browser->timeout(5);

while (@urls) {
  my $url = shift @urls;

  # Skip this URL and go on to the next one if we've
  # seen it before
  next if $visited{$url};

  my $request = HTTP::Request->new(GET => $url);
  my $response = $browser->request($request);

  # No real need to invoke printf if we're not doing
  # any formatting
  if ($response->is_error()) {print $response->status_line, "\n";}
  my $contents = $response->content();

  # Now that we've got the url's content, mark it as
  # visited
  $visited{$url} = 1;

  my ($page_parser) = HTML::LinkExtor->new(undef, $url);
  $page_parser->parse($contents)->eof;
  my @links = $page_parser->links;

  foreach my $link (@links) {
    print "$$link[2]\n";
    push @urls, $$link[2];
  }
  sleep 60;
}

On Wed, Oct 4, 2017 at 10:49 PM, Dan Book <gri...@gmail.com> wrote:

How can we proceed from here?
-Dan

On Mon, Sep 18, 2017 at 1:17 PM, Patrick M. Galbraith
<p...@patg.net> wrote:

Pali,

Great! Now we can start moving forward.

Sorry if my responses have been intermittent - first week at new
job.

Regards,

Patrick

On 9/16/17 4:35 AM, p...@cpan.org wrote:

I prepared branch master-new, which is based on current DBD-mysql
master
branch and revert state to pre-4.043 version, including all changes
done
after 4.043 release to master branch. I have this master-new branch
in
my fork. If you want you can use it...

https://github.com/pali/DBD-mysql/tree/master-new [1]


--

  [2]  [3]

Links:
------
[1] https://github.com/pali/DBD-mysql/tree/master-new
[2] http://www.facebook.com/martin.kaspar.547
[3] https://plus.google.com/u/0/104428351748591530426

Re: minor changes in a parser

Reply via email to