Hi again. This problem disappeared when I removed the "parse-js" plugin. I did so after determining that the problem occured when the html parser applies HtmlParseFilters. I'm not sure exactly what is the problem with the js parser, but I can provide an unparsed example segment if anyone is interested in looking at it.
I have just installed and run nutch 1.0-dev with the standard configuration, so I'm sure other will stumble across this. Should I report this as a bug in JIRA by the way? Regards, Svein 2008/4/17, Svein Yngvar Willassen <[EMAIL PROTECTED]>: > > A followup: > > The beginning of the outlinks is ok, but the end is junk. I've identified > these junk parts as fragments of JavaScript code elsewhere in the original > documents. They all have in common that they immediately follow a ' in the > js code. I tried changing the parser.html.impl property from "neko" to > "tagsoup", but that didn't change anything. > > I'd like to find out about this. Where do you think I should start adding > debug output to indentify the source of the problem? Would > DOMContentUtils.getOutlinks() be the right place to start? > > Regards, > > Svein > > > 2008/4/16, Svein Yngvar Willassen <[EMAIL PROTECTED]>: > > > > Hello folks, > > > > I'm completely puzzled over some strange behaviour I'm observing. For > > some reason, the parser seems to add a lot of text to the links so > > that the observed outlinks is much longer that their apperance in the > > html code. > > > > I just wanted to check if anyone else have seen this, if it is a known > > issue, or perhaps if I'm missing something completely. > > > > The fetched content: > > > > <ul class="vertical-list"> > > <li class="explore active"><a > > href="http://www.vox.com/explore/"><span>Explore Vox</span></a></li> > > <li class="bright"><a > > href="http://www.vox.com/culture/"><span>Culture</span></a></li> > > <li class="bright"><a > > href="http://www.vox.com/entertainment/ > > "><span>Entertainment</span></a></li> > > <li class="bright"><a > > href="http://www.vox.com/life/"><span>Life</span></a></li> > > <li class="bright"><a > > href="http://www.vox.com/music/"><span>Music</span></a></li> > > <li class="bright"><a > > href="http://www.vox.com/politics/"><span>News & > > Politics</span></a></li> > > <li class="bright last"><a > > href="http://www.vox.com/technology/"><span>Technology</span></a></li> > > </ul> > > > > Gives the following outlinks: > > > > outlink: toUrl: > > > > http://www.vox.com/explore/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > > anchor: Explore Vox > > outlink: toUrl: > > > > http://www.vox.com/culture/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > > anchor: Culture > > outlink: toUrl: > > > > http://www.vox.com/entertainment/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > > anchor: Entertainment > > outlink: toUrl: > > > > http://www.vox.com/life/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > > anchor: Life > > outlink: toUrl: > > > > http://www.vox.com/music/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > > anchor: Music > > outlink: toUrl: > > > > http://www.vox.com/politics/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > > anchor: News & Politics > > outlink: toUrl: > > > > http://www.vox.com/technology/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > > anchor: Technology > > > > Where does all that ";if(omd...." blah blah-stuff come from? > > > > > > -- > > Best Regards, > > > > Svein Y. Willassen > > http://willassen.blogspot.com/ > > > > > > -- > Best Regards, > > Svein Y. Willassen > http://willassen.blogspot.com/ > -- Best Regards, Svein Y. Willassen http://willassen.blogspot.com/
