Re: trying to understand HTML::TreeBuilder::XPath

Octavian Rasnita Sat, 26 Jan 2013 22:35:05 -0800

From: "Jeswin" <phillyj...@gmail.com>

Hi,
I'm trying to parse out the emails addresses from a webpage and I'm
using the HTML::TreeBuilder::XPath module. I don't really understand
XML and it's been a while since I worked with perl*. So far I mashed
up a code by looking through past examples online. The HTML portion
for the email is like:

<li class="ii">Email: <ahref="mailto:n...@place.edu";>n...@place.yyy</a></li>


The code I put together is:

#!/usr/bin/perl
use strict;
use warnings;

use HTML::TreeBuilder::XPath;

my $html  = HTML::TreeBuilder::XPath->new;
my $root  = $html->parse_file( 'file.htm' );

my @email = $root ->findnodes(q{//a} );

for my $email(@email) {

print $email->attr('href');
}

The problem is that it also outputs the link found in another portion
of the HTML ( <a href="http://sites.place.yyy/name";>). So I get a list
of websites and emails, one after another. How can I just output the
email section?

I also don't understand how the path for "findnodes(q{//a} )" works.
What's the "q" for? How do I understand the structure of nodes?

Thanks for any advice,
JJ

You should use HTML::TreeBuilder::XPath only if you know Xpath. Otherwise itis easier to just use HTML::TreeBuilder.


The code should be:

#!/usr/bin/perl
use strict;
use warnings;

use HTML::TreeBuilder::XPath;

my $html  = HTML::TreeBuilder::XPath->new;
my $root  = $html->parse_file( 'file.htm' );

#my @email = $root ->findnodes(q{//a} );

my @email = $root->look_down( _tag => 'a', sub {
   $_[0]->attr('href') && $_[0]->attr('href') =~ /^mailto:/;
} );

);

for my $email(@email) {
   print $email->attr('href');
}

There is no difference between common links and mailto links. Only that thevalue of the href attribute starts with mailto: in case of mailto links, soyou should check if that string starts with mailto: using regularexpressions.

(The code above is not tested, but it should work ifHTML::TreeBuilder::XPath inherits HTML::TreeBuilder.


Octavian


--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: trying to understand HTML::TreeBuilder::XPath

Reply via email to