After couple days intensive search over 100,000 urls, I found the one that makes the nutch crawl fails at the end of reduce when it is fetching at cycle 2. The bad URL is:
http://www.globalmedlaw.com/Canadam.html Do you have any idea why it brings down the Nutch at the end of the second cycle of fetching by this error: 060131 162414 reduce 50% 060131 162432 reduce 100% Exception in thread "main" java.io.IOException: Job failed! at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:347) at org.apache.nutch.crawl.Crawl.main(Crawl.java:111) Thanks, Mike On 1/30/06, Mike Smith <[EMAIL PROTECTED]> wrote: > > > I have a huge a disk too and /tmp folder was fine and has almost 200G free > space on that partition but it still fails. I am going to do the same and > look for the bad URL that makes the problem. But how come Nutch is sensitive > to a particular URL and fails!? It might be because of the parser plugins. > > Mike. > > > On 1/30/06, Rafit Izhak_Ratzin <[EMAIL PROTECTED]> wrote: > > > > Hi, > > I don't think its a problem of disc capacity since I am working on huge > > Disk > > and only 10% is used, > > > > What I decide to do is to split the seed into two part and see if I > > still > > get this problem so one half ended succesfuly but the second had the > > same > > problem so I continue with teh spliting, > > > > I tart with group of 80,000 URL and now I have a group of 5000 that when > > I > > Run them have this problem, I am continuing with this problem till I'll > > find > > the smallest group that has this problem and let you know about the > > seed. > > > > Thanks, > > Rafit > > > > > > > > >From: Ken Krugler < [EMAIL PROTECTED]> > > >Reply-To: [email protected] > > >To: [email protected] > > >Subject: Re: Problems with MapRed- > > >Date: Sun, 29 Jan 2006 16:42:15 -0800 > > > > > >>This looks like the namenode has lost connection to one of the > > datanodes. > > >>The default number of replications in ndfs is 3 and it seems like the > > >>namenode has only 2 in its list so it logs this warning. As Stefan > > >>suggested, you should check the diskspace on your machines. If I > > recall > > >>correctly datanodes crash when they run out of diskspace. > > >> > > >>This could also explain your problem with the fetching. One datanode > > runs > > >>out of diskspace and crashes while one of the tasktrackers is writing > > data > > >>to it. You should also check if the partition with /tmp has enough > > free > > >>space. > > > > > >Yes, that also can happen. > > > > > >Especially if, like us, you accidentally configure Nutch to use a > > directory > > >on the root volume, but your servers have been configured with a > > separate > > >filesystem for /data, and that's where all the disk capacity is > > located. > > > > > >-- Ken > > > > > > > > >>Stefan Groschupf schrieb: > > >>>may the hdds are full? > > >>>try: > > >>>bin/nutch ndfs -report > > >>>Nutch generates some temporarily data until processing. > > >>> > > >>>Am 30.01.2006 um 00:54 schrieb Mike Smith: > > >>> > > >>>>I forgot to mention the namenode log file gives me thousands of > > these: > > >>>> > > >>>>060129 155553 Zero targets found, > > >>>>forbidden1.size=2allowSameHostTargets=false > > >>>>forbidden2.size()=0 > > >>>>060129 155553 Zero targets found, > > >>>>forbidden1.size=2allowSameHostTargets=false > > >>>>forbidden2.size()=0 > > > > > > > > >-- > > >Ken Krugler > > >Krugle, Inc. > > >+1 530-470-9200 > > > > _________________________________________________________________ > > Express yourself instantly with MSN Messenger! Download today it's FREE! > > http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/ > > > > >
