After couple days intensive search over 100,000 urls, I found the one that
makes the nutch crawl fails at the end of reduce when it is fetching at
cycle 2. The bad URL is:

http://www.globalmedlaw.com/Canadam.html

Do you have any idea why it brings down the Nutch at the end of the second
cycle of fetching by this error:

060131 162414  reduce 50%
060131 162432  reduce 100%
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:347)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:111)

Thanks, Mike

On 1/30/06, Mike Smith <[EMAIL PROTECTED]> wrote:
>
>
> I have a huge a disk too and /tmp folder was fine and has almost 200G free
> space on that partition but it still fails. I am going to do the same and
> look for the bad URL that makes the problem. But how come Nutch is sensitive
> to a particular URL and fails!? It might be because of the parser plugins.
>
> Mike.
>
>
>  On 1/30/06, Rafit Izhak_Ratzin <[EMAIL PROTECTED]> wrote:
> >
> > Hi,
> > I don't think its a problem of disc capacity since I am working on huge
> > Disk
> > and only 10% is used,
> >
> > What I decide to do is to split the seed into two part and see if I
> > still
> > get this problem so one half ended succesfuly but the second had the
> > same
> > problem so I continue with teh spliting,
> >
> > I tart with group of 80,000 URL and now I have a group of 5000 that when
> > I
> > Run them have this problem, I am continuing with this problem till I'll
> > find
> > the smallest group that has this problem and let you know about the
> > seed.
> >
> > Thanks,
> > Rafit
> >
> >
> >
> > >From: Ken Krugler < [EMAIL PROTECTED]>
> > >Reply-To: [email protected]
> > >To: [email protected]
> > >Subject: Re: Problems with MapRed-
> > >Date: Sun, 29 Jan 2006 16:42:15 -0800
> > >
> > >>This looks like the namenode has lost connection to one of the
> > datanodes.
> > >>The default number of replications in ndfs is 3 and it seems like the
> > >>namenode has only 2 in its list so it logs this warning. As Stefan
> > >>suggested, you should check the diskspace on your machines. If I
> > recall
> > >>correctly datanodes crash when they run out of diskspace.
> > >>
> > >>This could also explain your problem with the fetching. One datanode
> > runs
> > >>out of diskspace and crashes while one of the tasktrackers is writing
> > data
> > >>to it. You should also check if the partition with /tmp has enough
> > free
> > >>space.
> > >
> > >Yes, that also can happen.
> > >
> > >Especially if, like us, you accidentally configure Nutch to use a
> > directory
> > >on the root volume, but your servers have been configured with a
> > separate
> > >filesystem for /data, and that's where all the disk capacity is
> > located.
> > >
> > >-- Ken
> > >
> > >
> > >>Stefan Groschupf schrieb:
> > >>>may the hdds are full?
> > >>>try:
> > >>>bin/nutch ndfs -report
> > >>>Nutch generates some temporarily data until processing.
> > >>>
> > >>>Am 30.01.2006 um 00:54 schrieb Mike Smith:
> > >>>
> > >>>>I forgot to mention the namenode log file gives me thousands of
> > these:
> > >>>>
> > >>>>060129 155553 Zero targets found,
> > >>>>forbidden1.size=2allowSameHostTargets=false
> > >>>>forbidden2.size()=0
> > >>>>060129 155553 Zero targets found,
> > >>>>forbidden1.size=2allowSameHostTargets=false
> > >>>>forbidden2.size()=0
> > >
> > >
> > >--
> > >Ken Krugler
> > >Krugle, Inc.
> > >+1 530-470-9200
> >
> > _________________________________________________________________
> > Express yourself instantly with MSN Messenger! Download today it's FREE!
> > http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
> >
> >
>

Reply via email to