I finally find out why this problem happens, there should be a problem with
the JS parser. Because I used this:

<name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>

instead of the default one which has JS in it and I could fetch
http://www.globalmedlaw.com/Canadam.html by depth 2. But, when I use

<name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value>

reduce will fail at the end fetching. I came up with this solution because
that page was using a redirected JS page to have some dynamic contents, but
by removing the JS plugin it worked fine. Now, I am going to have a larger
crawl over 100,000 seed urls to see if this really solved the problem.

Do you have any problem with JS parser?

Mike.



On 1/31/06, Mike Smith <[EMAIL PROTECTED]> wrote:
>
> After couple days intensive search over 100,000 urls, I found the one that
> makes the nutch crawl fails at the end of reduce when it is fetching at
> cycle 2. The bad URL is:
>
> http://www.globalmedlaw.com/Canadam.html
>
> Do you have any idea why it brings down the Nutch at the end of the second
> cycle of fetching by this error:
>
> 060131 162414  reduce 50%
> 060131 162432  reduce 100%
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:347)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:111)
>
> Thanks, Mike
>
>  On 1/30/06, Mike Smith <[EMAIL PROTECTED]> wrote:
> >
> >
> > I have a huge a disk too and /tmp folder was fine and has almost 200G
> > free space on that partition but it still fails. I am going to do the same
> > and look for the bad URL that makes the problem. But how come Nutch is
> > sensitive to a particular URL and fails!? It might be because of the parser
> > plugins.
> >
> > Mike.
> >
> >
> >  On 1/30/06, Rafit Izhak_Ratzin <[EMAIL PROTECTED] > wrote:
> > >
> > > Hi,
> > > I don't think its a problem of disc capacity since I am working on
> > > huge Disk
> > > and only 10% is used,
> > >
> > > What I decide to do is to split the seed into two part and see if I
> > > still
> > > get this problem so one half ended succesfuly but the second had the
> > > same
> > > problem so I continue with teh spliting,
> > >
> > > I tart with group of 80,000 URL and now I have a group of 5000 that
> > > when I
> > > Run them have this problem, I am continuing with this problem till
> > > I'll find
> > > the smallest group that has this problem and let you know about the
> > > seed.
> > >
> > > Thanks,
> > > Rafit
> > >
> > >
> > >
> > > >From: Ken Krugler < [EMAIL PROTECTED]>
> > > >Reply-To: [email protected]
> > > >To: [email protected]
> > > >Subject: Re: Problems with MapRed-
> > > >Date: Sun, 29 Jan 2006 16:42:15 -0800
> > > >
> > > >>This looks like the namenode has lost connection to one of the
> > > datanodes.
> > > >>The default number of replications in ndfs is 3 and it seems like
> > > the
> > > >>namenode has only 2 in its list so it logs this warning. As Stefan
> > > >>suggested, you should check the diskspace on your machines. If I
> > > recall
> > > >>correctly datanodes crash when they run out of diskspace.
> > > >>
> > > >>This could also explain your problem with the fetching. One datanode
> > > runs
> > > >>out of diskspace and crashes while one of the tasktrackers is
> > > writing data
> > > >>to it. You should also check if the partition with /tmp has enough
> > > free
> > > >>space.
> > > >
> > > >Yes, that also can happen.
> > > >
> > > >Especially if, like us, you accidentally configure Nutch to use a
> > > directory
> > > >on the root volume, but your servers have been configured with a
> > > separate
> > > >filesystem for /data, and that's where all the disk capacity is
> > > located.
> > > >
> > > >-- Ken
> > > >
> > > >
> > > >>Stefan Groschupf schrieb:
> > > >>>may the hdds are full?
> > > >>>try:
> > > >>>bin/nutch ndfs -report
> > > >>>Nutch generates some temporarily data until processing.
> > > >>>
> > > >>>Am 30.01.2006 um 00:54 schrieb Mike Smith:
> > > >>>
> > > >>>>I forgot to mention the namenode log file gives me thousands of
> > > these:
> > > >>>>
> > > >>>>060129 155553 Zero targets found,
> > > >>>>forbidden1.size=2allowSameHostTargets=false
> > > >>>>forbidden2.size()=0
> > > >>>>060129 155553 Zero targets found,
> > > >>>>forbidden1.size=2allowSameHostTargets=false
> > > >>>>forbidden2.size()=0
> > > >
> > > >
> > > >--
> > > >Ken Krugler
> > > >Krugle, Inc.
> > > >+1 530-470-9200
> > >
> > > _________________________________________________________________
> > > Express yourself instantly with MSN Messenger! Download today it's
> > > FREE!
> > > http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
> > >
> > >
> >
>

Reply via email to