Apologies for filling the thread with troubleshooting. I tried the same configuration on an identical server, and I still have the same exact errors. I used the same configuration on a Windows system over cygwin, and it works successfully. So now I'm wondering if there is some incompatibility with my OS or Java?
I'm running nutch 1.0 on AIX 6.1.0.0, with: java version "1.6.0" Java(TM) SE Runtime Environment (build pap6460sr5-20090529_04(SR5)) IBM J9 VM (build 2.4, J2RE 1.6.0 IBM J9 2.4 AIX ppc64-64 jvmap6460sr5-20090519_35743 (JIT enabled, AOT enabled) J9VM - 20090519_035743_BHdSMr JIT - r9_20090518_2017 GC - 20090417_AA) JCL - 20090529_01) It's the same OS as I was using to run Nutch 0.9, but with a different version of Java. |------------> | From: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |Joshua J Pavel/Raleigh/i...@ibmus | >--------------------------------------------------------------------------------------------------------------------------------------------------| |------------> | To: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |nutch-user@lucene.apache.org | >--------------------------------------------------------------------------------------------------------------------------------------------------| |------------> | Date: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |04/20/2010 09:01 AM | >--------------------------------------------------------------------------------------------------------------------------------------------------| |------------> | Subject: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |RE: Hadoop Disk Error | >--------------------------------------------------------------------------------------------------------------------------------------------------| I am - I changed the location to a filesystem with lots of free space and watched disk utilization during a crawl. It'll be a relatively small crawl, and I have gigs and gigs free. ---04/19/2010 05:53:53 PM---Are you sure that you have enough space in the temporary directory used by Hadoop? From: Joshua J Pa From: <arkadi.kosmy...@csiro.au> To: <nutch-user@lucene.apache.org> Date: 04/19/2010 05:53 PM Subject: RE: Hadoop Disk Error Are you sure that you have enough space in the temporary directory used by Hadoop? From: Joshua J Pavel [mailto:jpa...@us.ibm.com.] Sent: Tuesday, 20 April 2010 6:42 AM To: nutch-user@lucene.apache.org Subject: Re: Hadoop Disk Error Some more information, if anyone can help: If I turn fetcher.parse to "false", then it successfully fetches and crawls the site. and then bombs out with a larger ID for the job: 2010-04-19 20:34:48,342 WARN mapred.LocalJobRunner - job_local_0010 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_0010/attempt_local_0010_m_000000_0/output/spill0.out at org.apache.hadoop.fs.LocalDirAllocator $AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite (LocalDirAllocator.java:124) at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite (MapOutputFile.java:107) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill (MapTask.java:930) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:842) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) So, it's gotta be a problem with the parsing? The pages should all be UTF-8, and I know there are multiple languages involved. I tried setting parser.character.encoding.default to match, but it made no difference. I'd appreciate any ideas. [?cid:1__=0ABBFD99DFE290498f9e8a93df938@us.ibm.com.]Joshua J Pavel---04/16/2010 03:05:18 PM---fwiw, the error does seem to be valid: from the taskTracker/jobcache directory, I only have somethin From: Joshua J Pavel/Raleigh/i...@ibmus To: nutch-user@lucene.apache.org Date: 04/16/2010 03:05 PM Subject: Re: Hadoop Disk Error ________________________________ fwiw, the error does seem to be valid: from the taskTracker/jobcache directory, I only have something for job 1-4. ls -la total 0 drwxr-xr-x 6 root system 256 Apr 16 19:01 . drwxr-xr-x 3 root system 256 Apr 16 19:01 .. drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0001 drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0002 drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0003 drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0004 Joshua J Pavel---04/16/2010 09:00:35 AM---We're just now moving from a nutch .9 installation to 1.0, so I'm not entirely new to this. However From: Joshua J Pavel/Raleigh/i...@ibmus To: nutch-user@lucene.apache.org Date: 04/16/2010 09:00 AM Subject: Hadoop Disk Error ________________________________ We're just now moving from a nutch .9 installation to 1.0, so I'm not entirely new to this. However, I can't even get past the first fetch now, due to a hadoop error. Looking in the mailing list archives, normally this error is caused from either permissions or a full disk. I overrode the use of /tmp by setting hadoop.tmp.dir to a place with plenty of space, and I'm running the crawl as root, yet I'm still getting the error below. Any thoughts? Running on AIX with plenty of disk and RAM. 2010-04-16 12:49:51,972 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0 2010-04-16 12:49:52,267 INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 2010-04-16 12:49:52,268 INFO fetcher.Fetcher - -activeThreads=0, 2010-04-16 12:49:52,270 WARN mapred.LocalJobRunner - job_local_0005 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_0005/attempt_local_0005_m_000000_0/output/spill0.out at org.apache.hadoop.fs.LocalDirAllocator $AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite (LocalDirAllocator.java:124) at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite (MapOutputFile.java:107) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill (MapTask.java:930) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush (MapTask.java:842) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.LocalJobRunner$Job.run (LocalJobRunner.java:138)