Svein,

It sounds like this should be added to JIRA, though I wonder if this is just 
the case of some bad/invalid Javascript that confuses the js parser.  You'll 
want to include the URL where this problem happens and its source.  Probably 
best to grab the source with something like curl or wget and not your browser.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Svein Yngvar Willassen <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thursday, April 17, 2008 10:32:07 AM
Subject: Re: Parser bug?

Hi again.

This problem disappeared when I removed the "parse-js" plugin. I did so
after determining that the problem occured when the html parser applies
HtmlParseFilters. I'm not sure exactly what is the problem with the js
parser, but I can provide an unparsed example segment if anyone is
interested in looking at it.

I have just installed and run nutch 1.0-dev with the standard configuration,
so I'm sure other will stumble across this.

Should I report this as a bug in JIRA by the way?

Regards,

Svein

2008/4/17, Svein Yngvar Willassen <[EMAIL PROTECTED]>:
>
> A followup:
>
> The beginning of the outlinks is ok, but the end is junk. I've identified
> these junk parts as fragments of JavaScript code elsewhere in the original
> documents. They all have in common that they immediately follow a ' in the
> js code. I tried changing the parser.html.impl property from "neko" to
> "tagsoup", but that didn't change anything.
>
> I'd like to find out about this. Where do you think I should start adding
> debug output to indentify the source of the problem? Would
> DOMContentUtils.getOutlinks() be the right place to start?
>
> Regards,
>
> Svein
>
>
> 2008/4/16, Svein Yngvar Willassen <[EMAIL PROTECTED]>:
> >
> > Hello folks,
> >
> > I'm completely puzzled over some strange behaviour I'm observing. For
> > some reason, the parser seems to add a lot of text to the links so
> > that the observed outlinks is much longer that their apperance in the
> > html code.
> >
> > I just wanted to check if anyone else have seen this, if it is a known
> > issue, or perhaps if I'm missing something completely.
> >
> > The fetched content:
> >
> > <ul class="vertical-list">
> >            <li class="explore active"><a
> > href="http://www.vox.com/explore/";><span>Explore Vox</span></a></li>
> >            <li class="bright"><a
> > href="http://www.vox.com/culture/";><span>Culture</span></a></li>
> >            <li class="bright"><a
> > href="http://www.vox.com/entertainment/
> > "><span>Entertainment</span></a></li>
> >            <li class="bright"><a
> > href="http://www.vox.com/life/";><span>Life</span></a></li>
> >            <li class="bright"><a
> > href="http://www.vox.com/music/";><span>Music</span></a></li>
> >            <li class="bright"><a
> > href="http://www.vox.com/politics/";><span>News &amp;
> > Politics</span></a></li>
> >            <li class="bright last"><a
> > href="http://www.vox.com/technology/";><span>Technology</span></a></li>
> >        </ul>
> >
> > Gives the following outlinks:
> >
> > outlink: toUrl:
> >
> > http://www.vox.com/explore/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> > anchor: Explore Vox
> > outlink: toUrl:
> >
> > http://www.vox.com/culture/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> > anchor: Culture
> > outlink: toUrl:
> >
> > http://www.vox.com/entertainment/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> > anchor: Entertainment
> > outlink: toUrl:
> >
> > http://www.vox.com/life/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> > anchor: Life
> > outlink: toUrl:
> >
> > http://www.vox.com/music/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> > anchor: Music
> > outlink: toUrl:
> >
> > http://www.vox.com/politics/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> > anchor: News & Politics
> > outlink: toUrl:
> >
> > http://www.vox.com/technology/;if(omd.indexOf(/+s._in+/+s._in+/+s._in+/%3C%5C/span%3E/%3C%5C/span%3E/+s._in+/
> > anchor: Technology
> >
> > Where does all that ";if(omd...." blah blah-stuff come from?
> >
> >
> > --
> > Best Regards,
> >
> > Svein Y. Willassen
> > http://willassen.blogspot.com/
> >
>
>
>
> --
> Best Regards,
>
> Svein Y. Willassen
> http://willassen.blogspot.com/
>



-- 
Best Regards,

Svein Y. Willassen
http://willassen.blogspot.com/



Reply via email to