It happens in org.apache.nutch.parse.html.DOMContentUtils.getOutlinks()
which is called from org.apache.nutch.parse.html.HtmlParser. Running
some simple tests on your fragment below I get non outlink for this.
What version of Nutch are you running?
Dennis Kubes
Ian Holsman wrote:
> Hi.
>
> I'm trying to figure out how nutch actually extracts the links out of a
> piece of HTML.
>
> I'm getting confused in what parts TagSoup, NekoHTML, and parse-html
> play in all this.
>
> from what I can see the regular expression it is using to extract the
> link is slightly off, but i'm not sure
> where it actually does this bit.
>
> the fragment in question is this:
>
> <a href="#|"
> onclick='s_linkTrackVars="None";s_linkType="o";s_linkName=s_pfxID +
> ":NewsMaker: National, Political, World, Breaking News and More :" +
> nm_cur["newsmaker80631"] + " of
> 8";t=s_account.split(",");s_account2=(t[0].indexOf("aolsvc")==-1?t[0]:t[1]);s_lnk=s_co(this);s_gs(s_account2);return
>
> false;' id="newsmaker80631.pre"><img border="0"
> src="http://cdn.XXXX.XXXX.com/ch_news/backbtn" width="25" height="21"
> alt="Prev"/></a>
>
> and it is attempting to find ;s_account2=(t[0].indexOf(
>
>
>
> TIA
> Ian
>
> --
> Ian Holsman
> [EMAIL PROTECTED]
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers