Svein, It sounds like this should be added to JIRA, though I wonder if this is just the case of some bad/invalid Javascript that confuses the js parser. You'll want to include the URL where this problem happens and its source. Probably best to grab the source with something like curl or wget and not your browser.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- From: Svein Yngvar Willassen <[EMAIL PROTECTED]> To: [email protected] Sent: Thursday, April 17, 2008 10:32:07 AM Subject: Re: Parser bug? Hi again. This problem disappeared when I removed the "parse-js" plugin. I did so after determining that the problem occured when the html parser applies HtmlParseFilters. I'm not sure exactly what is the problem with the js parser, but I can provide an unparsed example segment if anyone is interested in looking at it. I have just installed and run nutch 1.0-dev with the standard configuration, so I'm sure other will stumble across this. Should I report this as a bug in JIRA by the way? Regards, Svein 2008/4/17, Svein Yngvar Willassen <[EMAIL PROTECTED]>: > > A followup: > > The beginning of the outlinks is ok, but the end is junk. I've identified > these junk parts as fragments of JavaScript code elsewhere in the original > documents. They all have in common that they immediately follow a ' in the > js code. I tried changing the parser.html.impl property from "neko" to > "tagsoup", but that didn't change anything. > > I'd like to find out about this. Where do you think I should start adding > debug output to indentify the source of the problem? Would > DOMContentUtils.getOutlinks() be the right place to start? > > Regards, > > Svein > > > 2008/4/16, Svein Yngvar Willassen <[EMAIL PROTECTED]>: > > > > Hello folks, > > > > I'm completely puzzled over some strange behaviour I'm observing. For > > some reason, the parser seems to add a lot of text to the links so > > that the observed outlinks is much longer that their apperance in the > > html code. > > > > I just wanted to check if anyone else have seen this, if it is a known > > issue, or perhaps if I'm missing something completely. > > > > The fetched content: > > > > <ul class="vertical-list"> > > <li class="explore active"><a > > href="http://www.vox.com/explore/"><span>Explore Vox</span></a></li> > > <li class="bright"><a > > href="http://www.vox.com/culture/"><span>Culture</span></a></li> > > <li class="bright"><a > > href="http://www.vox.com/entertainment/ > > "><span>Entertainment</span></a></li> > > <li class="bright"><a > > href="http://www.vox.com/life/"><span>Life</span></a></li> > > <li class="bright"><a > > href="http://www.vox.com/music/"><span>Music</span></a></li> > > <li class="bright"><a > > href="http://www.vox.com/politics/"><span>News & > > Politics</span></a></li> > > <li class="bright last"><a > > href="http://www.vox.com/technology/"><span>Technology</span></a></li> > > </ul> > > > > Gives the following outlinks: > > > > outlink: toUrl: > > > > http://www.vox.com/explore/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > > anchor: Explore Vox > > outlink: toUrl: > > > > http://www.vox.com/culture/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > > anchor: Culture > > outlink: toUrl: > > > > http://www.vox.com/entertainment/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > > anchor: Entertainment > > outlink: toUrl: > > > > http://www.vox.com/life/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > > anchor: Life > > outlink: toUrl: > > > > http://www.vox.com/music/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > > anchor: Music > > outlink: toUrl: > > > > http://www.vox.com/politics/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > > anchor: News & Politics > > outlink: toUrl: > > > > http://www.vox.com/technology/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > > anchor: Technology > > > > Where does all that ";if(omd...." blah blah-stuff come from? > > > > > > -- > > Best Regards, > > > > Svein Y. Willassen > > http://willassen.blogspot.com/ > > > > > > -- > Best Regards, > > Svein Y. Willassen > http://willassen.blogspot.com/ > -- Best Regards, Svein Y. Willassen http://willassen.blogspot.com/
