Hi Dennis,
thanks for the fast response.
I'm running the SVN head.
I'll try narrowing it down a bit further.
What led me to believe it was this was looking at what the fetcher
was fetching. It could have been we had some bad html on our servers,
but it's a standard header area.
regards
Ian
On 13/04/2007, at 11:17 AM, Dennis Kubes wrote:
It happens in
org.apache.nutch.parse.html.DOMContentUtils.getOutlinks() which is
called from org.apache.nutch.parse.html.HtmlParser. Running some
simple tests on your fragment below I get non outlink for this.
What version of Nutch are you running?
Dennis Kubes
Ian Holsman wrote:
Hi.
I'm trying to figure out how nutch actually extracts the links out
of a piece of HTML.
I'm getting confused in what parts TagSoup, NekoHTML, and parse-
html play in all this.
from what I can see the regular expression it is using to extract
the link is slightly off, but i'm not sure
where it actually does this bit.
the fragment in question is this:
<a href="#|"
onclick='s_linkTrackVars="None";s_linkType="o";s_linkName=s_pfxID
+ ":NewsMaker: National, Political, World, Breaking News and
More :" + nm_cur["newsmaker80631"] + " of 8";t=s_account.split
(",");s_account2=(t[0].indexOf("aolsvc")==-1?t[0]:t[1]);s_lnk=s_co
(this);s_gs(s_account2);return false;'
id="newsmaker80631.pre"><img border="0" src="http://
cdn.XXXX.XXXX.com/ch_news/backbtn" width="25" height="21"
alt="Prev"/></a>
and it is attempting to find ;s_account2=(t[0].indexOf(
TIA
Ian
--
Ian Holsman
[EMAIL PROTECTED]