trying to understand HTML::TreeBuilder::XPath

Jeswin Sat, 26 Jan 2013 12:44:48 -0800

Hi,
I'm trying to parse out the emails addresses from a webpage and I'm
using the HTML::TreeBuilder::XPath module. I don't really understand
XML and it's been a while since I worked with perl*. So far I mashed
up a code by looking through past examples online. The HTML portion
for the email is like:


<li class="ii">Email: <a href="mailto:n...@place.edu";>n...@place.yyy</a></li>

The code I put together is:

#!/usr/bin/perl
use strict;
use warnings;

use HTML::TreeBuilder::XPath;

my $html  = HTML::TreeBuilder::XPath->new;
my $root  = $html->parse_file( 'file.htm' );

my @email = $root ->findnodes(q{//a} );

for my $email(@email) {

print $email->attr('href');
}

The problem is that it also outputs the link found in another portion
of the HTML ( <a href="http://sites.place.yyy/name";>). So I get a list
of websites and emails, one after another. How can I just output the
email section?

I also don't understand how the path for "findnodes(q{//a} )" works.
What's the "q" for? How do I understand the structure of nodes?

Thanks for any advice,
JJ

*I'm not a programmer; I have a list to compile for work and thought I
might automate it to make my life easier.

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

trying to understand HTML::TreeBuilder::XPath

Reply via email to