Hi Dennis,
thanks for the fast response.

I'm running the SVN head.
I'll try narrowing it down a bit further.
What led me to believe it was this was looking at what the fetcher was fetching. It could have been we had some bad html on our servers, but it's a standard header area.

regards
Ian

On 13/04/2007, at 11:17 AM, Dennis Kubes wrote:

It happens in org.apache.nutch.parse.html.DOMContentUtils.getOutlinks() which is called from org.apache.nutch.parse.html.HtmlParser. Running some simple tests on your fragment below I get non outlink for this. What version of Nutch are you running?

Dennis Kubes

Ian Holsman wrote:
Hi.
I'm trying to figure out how nutch actually extracts the links out of a piece of HTML. I'm getting confused in what parts TagSoup, NekoHTML, and parse- html play in all this. from what I can see the regular expression it is using to extract the link is slightly off, but i'm not sure
where it actually does this bit.
the fragment in question is this:
<a href="#|" onclick='s_linkTrackVars="None";s_linkType="o";s_linkName=s_pfxID + ":NewsMaker: National, Political, World, Breaking News and More :" + nm_cur["newsmaker80631"] + " of 8";t=s_account.split (",");s_account2=(t[0].indexOf("aolsvc")==-1?t[0]:t[1]);s_lnk=s_co (this);s_gs(s_account2);return false;' id="newsmaker80631.pre"><img border="0" src="http:// cdn.XXXX.XXXX.com/ch_news/backbtn" width="25" height="21" alt="Prev"/></a>
and it is attempting to find ;s_account2=(t[0].indexOf(
TIA
Ian
--
Ian Holsman
[EMAIL PROTECTED]


Reply via email to