On 26/01/2013 20:44, Jeswin wrote: > Hi, > I'm trying to parse out the emails addresses from a webpage and I'm > using the HTML::TreeBuilder::XPath module. I don't really understand > XML and it's been a while since I worked with perl*. So far I mashed > up a code by looking through past examples online. The HTML portion > for the email is like: > > <li class="ii">Email: <a href="mailto:n...@place.edu">n...@place.yyy</a></li> > > The code I put together is: > > #!/usr/bin/perl > use strict; > use warnings; > > use HTML::TreeBuilder::XPath; > > my $html = HTML::TreeBuilder::XPath->new; > my $root = $html->parse_file( 'file.htm' ); > > my @email = $root ->findnodes(q{//a} ); > > for my $email(@email) { > > print $email->attr('href'); > } > > The problem is that it also outputs the link found in another portion > of the HTML ( <a href="http://sites.place.yyy/name">). So I get a list > of websites and emails, one after another. How can I just output the > email section? > > I also don't understand how the path for "findnodes(q{//a} )" works. > What's the "q" for? How do I understand the structure of nodes? > > Thanks for any advice, > JJ > > *I'm not a programmer; I have a list to compile for work and thought I > might automate it to make my life easier.
Hi Jeswin q{...} is just another way of writing single quotes. '//a' will do just fine. The // means descendant, so '//a' finds any <a> element beneath the root of the document. I'm not sure what you mean by "How do I understand the structure of nodes?" Do you know any HTML? If not then this is going to be very difficult. Since you're using HTML::TreeBuilder::XPath there are some easier options open to you. You can write my $html = HTML::TreeBuilder::XPath->new_from_file('file.htm'); my @links = $html->findnodes_as_strings('//@href[starts-with(., "mailto:")]'); which will find all the href="..." attributes that start with 'mailto:' and put their values into @links. Then you can print them all out using print "$_\n" for @links; If you want to go further and remove the 'mailto:' from the beginning, then its just for (@links) { my $mail = s/^mailto://r; print "$mail\n"; } HTH, Rob -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/