Re: trying to understand HTML::TreeBuilder::XPath

Jim Gibson Sat, 26 Jan 2013 15:52:50 -0800

On Jan 26, 2013, at 12:44 PM, Jeswin wrote:

> Hi,
> I'm trying to parse out the emails addresses from a webpage and I'm
> using the HTML::TreeBuilder::XPath module. I don't really understand
> XML and it's been a while since I worked with perl*. So far I mashed
> up a code by looking through past examples online. The HTML portion
> for the email is like:
> 
> <li class="ii">Email: <a href="mailto:n...@place.edu";>n...@place.yyy</a></li>
> 
> The code I put together is:
> 
> #!/usr/bin/perl
> use strict;
> use warnings;
> 
> use HTML::TreeBuilder::XPath;
> 
> my $html  = HTML::TreeBuilder::XPath->new;
> my $root  = $html->parse_file( 'file.htm' );
> 
> my @email = $root ->findnodes(q{//a} );
> 
> for my $email(@email) {
> 
> print $email->attr('href');
> }
> 
> The problem is that it also outputs the link found in another portion
> of the HTML ( <a href="http://sites.place.yyy/name";>). So I get a list
> of websites and emails, one after another. How can I just output the
> email section?
> 
> I also don't understand how the path for "findnodes(q{//a} )" works.
> What's the "q" for? How do I understand the structure of nodes?


I have not used the HTML::TreeBuilder::XPath module, and I am not familiar with 
XPath in general, so I can't explain what you are getting from the findnodes 
method or describe the structure of nodes. You may have to learn XPath to 
understand what the module is doing.

Perhaps you should be using a simpler module. I have been using 
HTML::TokeParser for a similar task.

However, if your program is successfully finding all of the <a> tag sections of 
the web page, and your only problem is distinguishing between email links and 
other types of links, you can use regular expressions to detect mailto links:

my $link = $email->attr('href');
if( $link =~ /mailto:([\w@]+) ) {
  print "Email address is '$1'\n";
}

As far as the q{//a} construct: the q operator is a single-quote construct that 
is used to define literal scalar string values. q{//a} is equivalent to '//a'. 
This is a specification for XPath which means "descendant-or-self node" with an 
<a> tag and is part of the XPath syntax, which, as I said, I do not know.



--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: trying to understand HTML::TreeBuilder::XPath

Reply via email to