Re: [Nutch-dev] problem parsing HTML

Dennis Kubes Thu, 12 Apr 2007 18:17:54 -0700

It happens in org.apache.nutch.parse.html.DOMContentUtils.getOutlinks() 
which is called from org.apache.nutch.parse.html.HtmlParser.  Running 
some simple tests on your fragment below I get non outlink for this. 
What version of Nutch are you running?


Dennis Kubes

Ian Holsman wrote:
> Hi.
> 
> I'm trying to figure out how nutch actually extracts the links out of a 
> piece of HTML.
> 
> I'm getting confused in what parts TagSoup, NekoHTML, and parse-html  
> play in all this.
> 
> from what I can see the regular expression it is using to extract the 
> link is slightly off, but i'm not sure
> where it actually does this bit.
> 
> the fragment in question is this:
> 
> <a href="#|" 
> onclick='s_linkTrackVars="None";s_linkType="o";s_linkName=s_pfxID + 
> ":NewsMaker: National, Political, World, Breaking News and More :" + 
> nm_cur["newsmaker80631"] + " of 
> 8";t=s_account.split(",");s_account2=(t[0].indexOf("aolsvc")==-1?t[0]:t[1]);s_lnk=s_co(this);s_gs(s_account2);return
>  
> false;' id="newsmaker80631.pre"><img border="0" 
> src="http://cdn.XXXX.XXXX.com/ch_news/backbtn"; width="25" height="21" 
> alt="Prev"/></a>
> 
> and it is attempting to find ;s_account2=(t[0].indexOf(
> 
> 
> 
> TIA
> Ian
> 
> -- 
> Ian Holsman
> [EMAIL PROTECTED]

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] problem parsing HTML

Reply via email to