Re: problem parsing HTML

Ian Holsman Thu, 12 Apr 2007 18:24:03 -0700

Hi Dennis,
thanks for the fast response.


I'm running the SVN head.
I'll try narrowing it down a bit further.

What led me to believe it was this was looking at what the fetcherwas fetching. It could have been we had some bad html on our servers,but it's a standard header area.


regards
Ian

On 13/04/2007, at 11:17 AM, Dennis Kubes wrote:

It happens inorg.apache.nutch.parse.html.DOMContentUtils.getOutlinks() which iscalled from org.apache.nutch.parse.html.HtmlParser. Running somesimple tests on your fragment below I get non outlink for this.What version of Nutch are you running?
Dennis Kubes

Ian Holsman wrote:
Hi.
I'm trying to figure out how nutch actually extracts the links outof a piece of HTML.I'm getting confused in what parts TagSoup, NekoHTML, and parse-html play in all this.from what I can see the regular expression it is using to extractthe link is slightly off, but i'm not sure
where it actually does this bit.
the fragment in question is this:
<a href="#|"onclick='s_linkTrackVars="None";s_linkType="o";s_linkName=s_pfxID+ ":NewsMaker: National, Political, World, Breaking News andMore :" + nm_cur["newsmaker80631"] + " of 8";t=s_account.split(",");s_account2=(t[0].indexOf("aolsvc")==-1?t[0]:t[1]);s_lnk=s_co(this);s_gs(s_account2);return false;'id="newsmaker80631.pre"><img border="0" src="http://cdn.XXXX.XXXX.com/ch_news/backbtn" width="25" height="21"alt="Prev"/></a>
and it is attempting to find ;s_account2=(t[0].indexOf(
TIA
Ian
--
Ian Holsman
[EMAIL PROTECTED]

Re: problem parsing HTML

Reply via email to