On Jan 26, 2013, at 12:44 PM, Jeswin wrote: > Hi, > I'm trying to parse out the emails addresses from a webpage and I'm > using the HTML::TreeBuilder::XPath module. I don't really understand > XML and it's been a while since I worked with perl*. So far I mashed > up a code by looking through past examples online. The HTML portion > for the email is like: > > <li class="ii">Email: <a href="mailto:n...@place.edu">n...@place.yyy</a></li> > > The code I put together is: > > #!/usr/bin/perl > use strict; > use warnings; > > use HTML::TreeBuilder::XPath; > > my $html = HTML::TreeBuilder::XPath->new; > my $root = $html->parse_file( 'file.htm' ); > > my @email = $root ->findnodes(q{//a} ); > > for my $email(@email) { > > print $email->attr('href'); > } > > The problem is that it also outputs the link found in another portion > of the HTML ( <a href="http://sites.place.yyy/name">). So I get a list > of websites and emails, one after another. How can I just output the > email section? > > I also don't understand how the path for "findnodes(q{//a} )" works. > What's the "q" for? How do I understand the structure of nodes?
I have not used the HTML::TreeBuilder::XPath module, and I am not familiar with XPath in general, so I can't explain what you are getting from the findnodes method or describe the structure of nodes. You may have to learn XPath to understand what the module is doing. Perhaps you should be using a simpler module. I have been using HTML::TokeParser for a similar task. However, if your program is successfully finding all of the <a> tag sections of the web page, and your only problem is distinguishing between email links and other types of links, you can use regular expressions to detect mailto links: my $link = $email->attr('href'); if( $link =~ /mailto:([\w@]+) ) { print "Email address is '$1'\n"; } As far as the q{//a} construct: the q operator is a single-quote construct that is used to define literal scalar string values. q{//a} is equivalent to '//a'. This is a specification for XPath which means "descendant-or-self node" with an <a> tag and is part of the XPath syntax, which, as I said, I do not know. -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/