Hi again.

This problem disappeared when I removed the "parse-js" plugin. I did so
after determining that the problem occured when the html parser applies
HtmlParseFilters. I'm not sure exactly what is the problem with the js
parser, but I can provide an unparsed example segment if anyone is
interested in looking at it.

I have just installed and run nutch 1.0-dev with the standard configuration,
so I'm sure other will stumble across this.

Should I report this as a bug in JIRA by the way?

Regards,

Svein

2008/4/17, Svein Yngvar Willassen <[EMAIL PROTECTED]>:
>
> A followup:
>
> The beginning of the outlinks is ok, but the end is junk. I've identified
> these junk parts as fragments of JavaScript code elsewhere in the original
> documents. They all have in common that they immediately follow a ' in the
> js code. I tried changing the parser.html.impl property from "neko" to
> "tagsoup", but that didn't change anything.
>
> I'd like to find out about this. Where do you think I should start adding
> debug output to indentify the source of the problem? Would
> DOMContentUtils.getOutlinks() be the right place to start?
>
> Regards,
>
> Svein
>
>
> 2008/4/16, Svein Yngvar Willassen <[EMAIL PROTECTED]>:
> >
> > Hello folks,
> >
> > I'm completely puzzled over some strange behaviour I'm observing. For
> > some reason, the parser seems to add a lot of text to the links so
> > that the observed outlinks is much longer that their apperance in the
> > html code.
> >
> > I just wanted to check if anyone else have seen this, if it is a known
> > issue, or perhaps if I'm missing something completely.
> >
> > The fetched content:
> >
> > <ul class="vertical-list">
> >            <li class="explore active"><a
> > href="http://www.vox.com/explore/";><span>Explore Vox</span></a></li>
> >            <li class="bright"><a
> > href="http://www.vox.com/culture/";><span>Culture</span></a></li>
> >            <li class="bright"><a
> > href="http://www.vox.com/entertainment/
> > "><span>Entertainment</span></a></li>
> >            <li class="bright"><a
> > href="http://www.vox.com/life/";><span>Life</span></a></li>
> >            <li class="bright"><a
> > href="http://www.vox.com/music/";><span>Music</span></a></li>
> >            <li class="bright"><a
> > href="http://www.vox.com/politics/";><span>News &amp;
> > Politics</span></a></li>
> >            <li class="bright last"><a
> > href="http://www.vox.com/technology/";><span>Technology</span></a></li>
> >        </ul>
> >
> > Gives the following outlinks:
> >
> > outlink: toUrl:
> >
> > http://www.vox.com/explore/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> > anchor: Explore Vox
> > outlink: toUrl:
> >
> > http://www.vox.com/culture/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> > anchor: Culture
> > outlink: toUrl:
> >
> > http://www.vox.com/entertainment/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> > anchor: Entertainment
> > outlink: toUrl:
> >
> > http://www.vox.com/life/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> > anchor: Life
> > outlink: toUrl:
> >
> > http://www.vox.com/music/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> > anchor: Music
> > outlink: toUrl:
> >
> > http://www.vox.com/politics/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> > anchor: News & Politics
> > outlink: toUrl:
> >
> > http://www.vox.com/technology/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> > anchor: Technology
> >
> > Where does all that ";if(omd...." blah blah-stuff come from?
> >
> >
> > --
> > Best Regards,
> >
> > Svein Y. Willassen
> > http://willassen.blogspot.com/
> >
>
>
>
> --
> Best Regards,
>
> Svein Y. Willassen
> http://willassen.blogspot.com/
>



-- 
Best Regards,

Svein Y. Willassen
http://willassen.blogspot.com/

Reply via email to