It happens in org.apache.nutch.parse.html.DOMContentUtils.getOutlinks() which is called from org.apache.nutch.parse.html.HtmlParser. Running some simple tests on your fragment below I get non outlink for this. What version of Nutch are you running?
Dennis Kubes Ian Holsman wrote: > Hi. > > I'm trying to figure out how nutch actually extracts the links out of a > piece of HTML. > > I'm getting confused in what parts TagSoup, NekoHTML, and parse-html > play in all this. > > from what I can see the regular expression it is using to extract the > link is slightly off, but i'm not sure > where it actually does this bit. > > the fragment in question is this: > > <a href="#|" > onclick='s_linkTrackVars="None";s_linkType="o";s_linkName=s_pfxID + > ":NewsMaker: National, Political, World, Breaking News and More :" + > nm_cur["newsmaker80631"] + " of > 8";t=s_account.split(",");s_account2=(t[0].indexOf("aolsvc")==-1?t[0]:t[1]);s_lnk=s_co(this);s_gs(s_account2);return > > false;' id="newsmaker80631.pre"><img border="0" > src="http://cdn.XXXX.XXXX.com/ch_news/backbtn" width="25" height="21" > alt="Prev"/></a> > > and it is attempting to find ;s_account2=(t[0].indexOf( > > > > TIA > Ian > > -- > Ian Holsman > [EMAIL PROTECTED] ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers