Hi. I'm trying to figure out how nutch actually extracts the links out of a piece of HTML.
I'm getting confused in what parts TagSoup, NekoHTML, and parse-html play in all this. from what I can see the regular expression it is using to extract the link is slightly off, but i'm not sure where it actually does this bit. the fragment in question is this: <a href="#|" onclick='s_linkTrackVars="None";s_linkType="o";s_linkName=s_pfxID + ":NewsMaker: National, Political, World, Breaking News and More :" + nm_cur["newsmaker80631"] + " of 8";t=s_account.split(",");s_account2= (t[0].indexOf("aolsvc")==-1?t[0]:t[1]);s_lnk=s_co(this);s_gs (s_account2);return false;' id="newsmaker80631.pre"><img border="0" src="http://cdn.XXXX.XXXX.com/ch_news/backbtn" width="25" height="21" alt="Prev"/></a> and it is attempting to find ;s_account2=(t[0].indexOf( TIA Ian -- Ian Holsman [EMAIL PROTECTED] ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers