In article <[EMAIL PROTECTED]>, Lorne Easton wrote:
> I need to write some code that extracts that extracts hyperlinks from a
> scalar ($data) and puts them into an array.
> 
> I imagine that grep can do this, but my mastery of it and
> reqular expressions are not brilliant.
> 
> Can you please provide some example code, or at least point me in the right
> direction?

 If you only need the URLs of the hyperlinks, then HTML::LinkExtor is
 just what you need, and it is provided with HTML::Parser.
 HTML::SimpleLinkExtor might be worth a try too.

 http://search.cpan.org/search?dist=HTML-SimpleLinkExtor
 http://search.cpan.org/search?dist=HTML-Parser

 Otherwise, if you want the URLs and the text inside, something like the
 following might work:

#!/usr/bin/perl -w
use strict;
use HTML::Parser 3;

my $data = <<'_HTML_';
        <p><a href="http://foo";>bar</a><br>
        foo text baz
        <a href="http://baz";>quux</a></p>
_HTML_

my @links = parse_links($data);

# We now print the links we found
my $count;
foreach (@links){
        print ++$count . ". Description: $_->[1]\n   URL: $_->[0]\n\n"
}


sub parse_links {
    my $data = shift;

    my ( @links, $inside );
    my $count = 0;

        # Preparing the parser
    my $linkparser = HTML::Parser->new(
        report_tags   => ['a'],         # Only dealing with <A> tags
        unbroken_text => 1,                     # Avoid text split over several lines


                # Called each time a <A ...> is found           
        start_h       => [
            sub {
                                # Storing the HREF attribute
                $links[$count] = shift->{href};
                                # We should recall we're inside a <A> element
                $inside = 1;
            },
            'attr'
        ],

                # Called when </A> is found
        end_h => [ sub { $count++; $inside = 0; }, '' ],
                
                # Called when text is found
        text_h => [ sub {
                                # We're only interested in text inside <a>...</a>
                return unless $inside;
                                # Store the text with the previous stored HREF 
                                # attribute
                $links[$count] = [ $links[$count], shift ];
            },
            'dtext'
        ],
    );
        # Launch the parser
    $linkparser->parse($data)->eof();
    return wantarray ? @links : \@links;
}

__END__

-- 
briac
        A flying swallow. 
        A fox stalks under a she-oak. 
        A nesting dove.

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to