[Nutch-dev] problem parsing HTML

Ian Holsman Thu, 12 Apr 2007 17:05:41 -0700

Hi.

I'm trying to figure out how nutch actually extracts the links out of  
a piece of HTML.


I'm getting confused in what parts TagSoup, NekoHTML, and parse-html   
play in all this.

from what I can see the regular expression it is using to extract the  
link is slightly off, but i'm not sure
where it actually does this bit.

the fragment in question is this:

<a href="#|"  
onclick='s_linkTrackVars="None";s_linkType="o";s_linkName=s_pfxID +  
":NewsMaker: National, Political, World, Breaking News and More :" +  
nm_cur["newsmaker80631"] + " of 8";t=s_account.split(",");s_account2= 
(t[0].indexOf("aolsvc")==-1?t[0]:t[1]);s_lnk=s_co(this);s_gs 
(s_account2);return false;' id="newsmaker80631.pre"><img border="0"  
src="http://cdn.XXXX.XXXX.com/ch_news/backbtn"; width="25" height="21"  
alt="Prev"/></a>

and it is attempting to find ;s_account2=(t[0].indexOf(



TIA
Ian

--
Ian Holsman
[EMAIL PROTECTED]

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] problem parsing HTML

Reply via email to