From: "Jeswin" <phillyj...@gmail.com>
Hi,
I'm trying to parse out the emails addresses from a webpage and I'm
using the HTML::TreeBuilder::XPath module. I don't really understand
XML and it's been a while since I worked with perl*. So far I mashed
up a code by looking through past examples online. The HTML portion
for the email is like:
<li class="ii">Email: <a
href="mailto:n...@place.edu">n...@place.yyy</a></li>
The code I put together is:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder::XPath;
my $html = HTML::TreeBuilder::XPath->new;
my $root = $html->parse_file( 'file.htm' );
my @email = $root ->findnodes(q{//a} );
for my $email(@email) {
print $email->attr('href');
}
The problem is that it also outputs the link found in another portion
of the HTML ( <a href="http://sites.place.yyy/name">). So I get a list
of websites and emails, one after another. How can I just output the
email section?
I also don't understand how the path for "findnodes(q{//a} )" works.
What's the "q" for? How do I understand the structure of nodes?
Thanks for any advice,
JJ
You should use HTML::TreeBuilder::XPath only if you know Xpath. Otherwise it
is easier to just use HTML::TreeBuilder.
The code should be:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder::XPath;
my $html = HTML::TreeBuilder::XPath->new;
my $root = $html->parse_file( 'file.htm' );
#my @email = $root ->findnodes(q{//a} );
my @email = $root->look_down( _tag => 'a', sub {
$_[0]->attr('href') && $_[0]->attr('href') =~ /^mailto:/;
} );
);
for my $email(@email) {
print $email->attr('href');
}
There is no difference between common links and mailto links. Only that the
value of the href attribute starts with mailto: in case of mailto links, so
you should check if that string starts with mailto: using regular
expressions.
(The code above is not tested, but it should work if
HTML::TreeBuilder::XPath inherits HTML::TreeBuilder.
Octavian
--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/