A followup: The beginning of the outlinks is ok, but the end is junk. I've identified these junk parts as fragments of JavaScript code elsewhere in the original documents. They all have in common that they immediately follow a ' in the js code. I tried changing the parser.html.impl property from "neko" to "tagsoup", but that didn't change anything.
I'd like to find out about this. Where do you think I should start adding debug output to indentify the source of the problem? Would DOMContentUtils.getOutlinks() be the right place to start? Regards, Svein 2008/4/16, Svein Yngvar Willassen <[EMAIL PROTECTED]>: > > Hello folks, > > I'm completely puzzled over some strange behaviour I'm observing. For > some reason, the parser seems to add a lot of text to the links so > that the observed outlinks is much longer that their apperance in the > html code. > > I just wanted to check if anyone else have seen this, if it is a known > issue, or perhaps if I'm missing something completely. > > The fetched content: > > <ul class="vertical-list"> > <li class="explore active"><a > href="http://www.vox.com/explore/"><span>Explore Vox</span></a></li> > <li class="bright"><a > href="http://www.vox.com/culture/"><span>Culture</span></a></li> > <li class="bright"><a > href="http://www.vox.com/entertainment/ > "><span>Entertainment</span></a></li> > <li class="bright"><a > href="http://www.vox.com/life/"><span>Life</span></a></li> > <li class="bright"><a > href="http://www.vox.com/music/"><span>Music</span></a></li> > <li class="bright"><a > href="http://www.vox.com/politics/"><span>News & > Politics</span></a></li> > <li class="bright last"><a > href="http://www.vox.com/technology/"><span>Technology</span></a></li> > </ul> > > Gives the following outlinks: > > outlink: toUrl: > > http://www.vox.com/explore/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > anchor: Explore Vox > outlink: toUrl: > > http://www.vox.com/culture/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > anchor: Culture > outlink: toUrl: > > http://www.vox.com/entertainment/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > anchor: Entertainment > outlink: toUrl: > > http://www.vox.com/life/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > anchor: Life > outlink: toUrl: > > http://www.vox.com/music/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > anchor: Music > outlink: toUrl: > > http://www.vox.com/politics/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > anchor: News & Politics > outlink: toUrl: > > http://www.vox.com/technology/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/ > anchor: Technology > > Where does all that ";if(omd...." blah blah-stuff come from? > > > -- > Best Regards, > > Svein Y. Willassen > http://willassen.blogspot.com/ > -- Best Regards, Svein Y. Willassen http://willassen.blogspot.com/
