It is quite beyond the capabilities of javascript parser to correctly
produce the right urls, especially if the urls are specifically messed by by
the web designer.
So why not try other solutions , such as web scrape or auto test tools like
selenium, watir , to do the work, the interpretation of javascripts are
handled directly by the browser , so no messing would happen.

David Cai, cto of
v-search.51vip.biz

2008/10/11 Kevin MacDonald <[EMAIL PROTECTED]>

> I don't see how that's possible unless you improve the javascript parser. I
> actually think it's pretty much impossible to get links properly from
> javascript unless the script is actually interpreted and executed, which is
> a much different task than what the parser plugin does.
>
> On Fri, Oct 10, 2008 at 1:17 AM, Höchstötter Nadine <
> [EMAIL PROTECTED]> wrote:
>
> > Thank you for your answer. But, when I override the Javascript plugin,
> > there will be links missing. Is there any possibility to get those
> > javascript urls?
> > Thanks.
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Kevin MacDonald [mailto:[EMAIL PROTECTED]
> > Gesendet: Donnerstag, 9. Oktober 2008 19:26
> > An: [email protected]
> > Betreff: Re: db_gone/javascript/invalid URLs
> >
> > I encountered that error as well. I believe it's happening because the
> > javascript parser is trying to pull valid urls out of javascript, which
> is
> > highly optimistic considering that such urls may be getting pieced
> together
> > using string appends. I would override the 'plugin.includes' config value
> > in
> > nutch-default.xml (by placing it in nutch-site.xml) and turn off
> javascript
> > parsing.
> >
> > On Thu, Oct 9, 2008 at 8:13 AM, Höchstötter Nadine <
> > [EMAIL PROTECTED]> wrote:
> >
> > > Hi all,
> > > I have a problem with javascript. I tried to crawl bild.de and I got
> > many
> > > links not having been fetched. I got the stats and they mostly say
> > "Status
> > > 3:  (db_gone)". With a look at those urls entitled "db_gone" you will
> see
> > > some weird things as listed below the email. I just listed a few. I do
> > not
> > > think that this is only a javascript problem but probably also  a url
> > > normalization problem. Does anybody know how to deal with it? Thanks,
> > > Nadine.
> > >
> > > http://software.bild.de/js/6M/x-6N-6Q-6T
> > >
> > > Status: 3 (db_gone)
> > >
> > > http://software.bild.de/js/;l(6.1f(7<http://software.bild.de/js/;l%286.1f%287>
> ,
> > >
> > > Status: 3 (db_gone)
> > >
> > > http://software.bild.de/js/</22>
> > >
> > > Status: 3 (db_gone)
> > >
> > > http://software.bild.de/js/</4t></29></22>
> > >
> > > Status: 3 (db_gone)
> > >
> > > http://software.bild.de/js/a.1i
> > >
> > > Status: 3 (db_gone)
> > >
> > > http://software.bild.de/js/},4o:q(){6(7)[6(7).4E(
> > >
> > > Status: 3 (db_gone)
> > >
> > > http://software.bild.de/ratgeber-karriere/jobs/allgemein
> > >
> > > Status: 3 (db_gone)
> > >
> > > http://software.bild.de/text/javascript
> > >
> > > Status: 3 (db_gone)
> > >
> > > http://software.bild.de/top.document.all.
> > >
> > > Status: 3 (db_gone)
> > >
> > > http://tv.bild.de/+escape(document.referrer)+<http://tv.bild.de/+escape%28document.referrer%29+>
> > >
> > > Status: 3 (db_gone)
> > >
> > > http://tv.bild.de/_js/+escape(document.referrer)+<http://tv.bild.de/_js/+escape%28document.referrer%29+>
> > >
> > > Status: 3 (db_gone)
> > >
> > > http://tv.bild.de/_js/...
> > >
> > > Status: 3 (db_gone)
> > >
> > > http://tv.bild.de/_js/1.5.1.1
> > >
> > > Status: 3 (db_gone)
> > >
> > > http://tv.bild.de/_js/</tbody></+escape(document.referrer)+
> > >
> > > Status: 3 (db_gone)
> > >
> > > http://tv.bild.de/_js/</tbody></bild//CP//+escape(document.referrer)+
> > >
> > > Status: 3 (db_gone)
> > >
> > > http://tv.bild.de/_js/
> > > </tbody></bild//CP//bild//CP//+escape(document.referrer)+
> > >
> > > Status: 3 (db_gone)
> > >
> > > http://tv.bild.de/_js/
> > >
> >
> </tbody></bild//CP//bild//CP//entertainment/body/tv/tvprogramm/tvprogramm/home/+escape(document.referrer)+
> > >
> > > Status: 3 (db_gone)
> > >
> > >
> > >
> >
>

Reply via email to