I finally find out why this problem happens, there should be a problem with the JS parser. Because I used this:
<name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value> instead of the default one which has JS in it and I could fetch http://www.globalmedlaw.com/Canadam.html by depth 2. But, when I use <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value> reduce will fail at the end fetching. I came up with this solution because that page was using a redirected JS page to have some dynamic contents, but by removing the JS plugin it worked fine. Now, I am going to have a larger crawl over 100,000 seed urls to see if this really solved the problem. Do you have any problem with JS parser? Mike. On 1/31/06, Mike Smith <[EMAIL PROTECTED]> wrote: > > After couple days intensive search over 100,000 urls, I found the one that > makes the nutch crawl fails at the end of reduce when it is fetching at > cycle 2. The bad URL is: > > http://www.globalmedlaw.com/Canadam.html > > Do you have any idea why it brings down the Nutch at the end of the second > cycle of fetching by this error: > > 060131 162414 reduce 50% > 060131 162432 reduce 100% > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308) > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:347) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:111) > > Thanks, Mike > > On 1/30/06, Mike Smith <[EMAIL PROTECTED]> wrote: > > > > > > I have a huge a disk too and /tmp folder was fine and has almost 200G > > free space on that partition but it still fails. I am going to do the same > > and look for the bad URL that makes the problem. But how come Nutch is > > sensitive to a particular URL and fails!? It might be because of the parser > > plugins. > > > > Mike. > > > > > > On 1/30/06, Rafit Izhak_Ratzin <[EMAIL PROTECTED] > wrote: > > > > > > Hi, > > > I don't think its a problem of disc capacity since I am working on > > > huge Disk > > > and only 10% is used, > > > > > > What I decide to do is to split the seed into two part and see if I > > > still > > > get this problem so one half ended succesfuly but the second had the > > > same > > > problem so I continue with teh spliting, > > > > > > I tart with group of 80,000 URL and now I have a group of 5000 that > > > when I > > > Run them have this problem, I am continuing with this problem till > > > I'll find > > > the smallest group that has this problem and let you know about the > > > seed. > > > > > > Thanks, > > > Rafit > > > > > > > > > > > > >From: Ken Krugler < [EMAIL PROTECTED]> > > > >Reply-To: [email protected] > > > >To: [email protected] > > > >Subject: Re: Problems with MapRed- > > > >Date: Sun, 29 Jan 2006 16:42:15 -0800 > > > > > > > >>This looks like the namenode has lost connection to one of the > > > datanodes. > > > >>The default number of replications in ndfs is 3 and it seems like > > > the > > > >>namenode has only 2 in its list so it logs this warning. As Stefan > > > >>suggested, you should check the diskspace on your machines. If I > > > recall > > > >>correctly datanodes crash when they run out of diskspace. > > > >> > > > >>This could also explain your problem with the fetching. One datanode > > > runs > > > >>out of diskspace and crashes while one of the tasktrackers is > > > writing data > > > >>to it. You should also check if the partition with /tmp has enough > > > free > > > >>space. > > > > > > > >Yes, that also can happen. > > > > > > > >Especially if, like us, you accidentally configure Nutch to use a > > > directory > > > >on the root volume, but your servers have been configured with a > > > separate > > > >filesystem for /data, and that's where all the disk capacity is > > > located. > > > > > > > >-- Ken > > > > > > > > > > > >>Stefan Groschupf schrieb: > > > >>>may the hdds are full? > > > >>>try: > > > >>>bin/nutch ndfs -report > > > >>>Nutch generates some temporarily data until processing. > > > >>> > > > >>>Am 30.01.2006 um 00:54 schrieb Mike Smith: > > > >>> > > > >>>>I forgot to mention the namenode log file gives me thousands of > > > these: > > > >>>> > > > >>>>060129 155553 Zero targets found, > > > >>>>forbidden1.size=2allowSameHostTargets=false > > > >>>>forbidden2.size()=0 > > > >>>>060129 155553 Zero targets found, > > > >>>>forbidden1.size=2allowSameHostTargets=false > > > >>>>forbidden2.size()=0 > > > > > > > > > > > >-- > > > >Ken Krugler > > > >Krugle, Inc. > > > >+1 530-470-9200 > > > > > > _________________________________________________________________ > > > Express yourself instantly with MSN Messenger! Download today it's > > > FREE! > > > http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/ > > > > > > > > >
