A followup:

The beginning of the outlinks is ok, but the end is junk. I've identified
these junk parts as fragments of JavaScript code elsewhere in the original
documents. They all have in common that they immediately follow a ' in the
js code. I tried changing the parser.html.impl property from "neko" to
"tagsoup", but that didn't change anything.

I'd like to find out about this. Where do you think I should start adding
debug output to indentify the source of the problem? Would
DOMContentUtils.getOutlinks() be the right place to start?

Regards,

Svein


2008/4/16, Svein Yngvar Willassen <[EMAIL PROTECTED]>:
>
> Hello folks,
>
> I'm completely puzzled over some strange behaviour I'm observing. For
> some reason, the parser seems to add a lot of text to the links so
> that the observed outlinks is much longer that their apperance in the
> html code.
>
> I just wanted to check if anyone else have seen this, if it is a known
> issue, or perhaps if I'm missing something completely.
>
> The fetched content:
>
> <ul class="vertical-list">
>            <li class="explore active"><a
> href="http://www.vox.com/explore/";><span>Explore Vox</span></a></li>
>            <li class="bright"><a
> href="http://www.vox.com/culture/";><span>Culture</span></a></li>
>            <li class="bright"><a
> href="http://www.vox.com/entertainment/
> "><span>Entertainment</span></a></li>
>            <li class="bright"><a
> href="http://www.vox.com/life/";><span>Life</span></a></li>
>            <li class="bright"><a
> href="http://www.vox.com/music/";><span>Music</span></a></li>
>            <li class="bright"><a
> href="http://www.vox.com/politics/";><span>News &amp;
> Politics</span></a></li>
>            <li class="bright last"><a
> href="http://www.vox.com/technology/";><span>Technology</span></a></li>
>        </ul>
>
> Gives the following outlinks:
>
> outlink: toUrl:
>
> http://www.vox.com/explore/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> anchor: Explore Vox
> outlink: toUrl:
>
> http://www.vox.com/culture/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> anchor: Culture
> outlink: toUrl:
>
> http://www.vox.com/entertainment/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> anchor: Entertainment
> outlink: toUrl:
>
> http://www.vox.com/life/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> anchor: Life
> outlink: toUrl:
>
> http://www.vox.com/music/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> anchor: Music
> outlink: toUrl:
>
> http://www.vox.com/politics/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> anchor: News & Politics
> outlink: toUrl:
>
> http://www.vox.com/technology/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> anchor: Technology
>
> Where does all that ";if(omd...." blah blah-stuff come from?
>
>
> --
> Best Regards,
>
> Svein Y. Willassen
> http://willassen.blogspot.com/
>



-- 
Best Regards,

Svein Y. Willassen
http://willassen.blogspot.com/

Reply via email to