I get the same error on a filesystem with 10 GB (disk space is a commodity here). The final crawl when it succeeds on my Windows machine is 93 MB, so I really hope it doesn't need more than 10 GB to even pull down and parse the first URL. Is there something concerning threading that could introduce a job that gets started before the successfully completion of a dependant job? This is running on the same machine as .9 did successfully, so the only difference is the JDK and the code.
Thanks again for taking a look at this with me. |------------> | From: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |<arkadi.kosmy...@csiro.au> | >--------------------------------------------------------------------------------------------------------------------------------------------------| |------------> | To: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |<nutch-user@lucene.apache.org> | >--------------------------------------------------------------------------------------------------------------------------------------------------| |------------> | Date: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |04/20/2010 06:30 PM | >--------------------------------------------------------------------------------------------------------------------------------------------------| |------------> | Subject: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |RE: Hadoop Disk Error | >--------------------------------------------------------------------------------------------------------------------------------------------------| 1 or even 2 GB are far from impressing. Why don't you switch hadoop.tmp.dir to a place with, say, 50GB free? Your task may be successful on Windows just because the temp space limit is different there. From: Joshua J Pavel [mailto:jpa...@us.ibm.com] Sent: Wednesday, 21 April 2010 3:40 AM To: nutch-user@lucene.apache.org Subject: Re: Hadoop Disk Error Yes - how much free space does it need? We ran 0.9 using /tmp, and that has ~ 1 GB. After I first saw this error, I moved it to another filesystem where I have 2 GB free (maybe not "gigs and gigs", but more than I think I need to complete a small test crawl?). [cid:1__=0ABBFD98DFF359758f9e8a93df938@us.ibm.com]Julien Nioche ---04/20/2010 12:36:10 PM---Hi Joshua, The error message you got definitely indicates that you are running out of From: Julien Nioche <lists.digitalpeb...@gmail.com> To: nutch-user@lucene.apache.org Date: 04/20/2010 12:36 PM Subject: Re: Hadoop Disk Error ________________________________ Hi Joshua, The error message you got definitely indicates that you are running out of space. Have you changed the value of hadoop.tmp.dir in the config file? J. -- DigitalPebble Ltd http://www.digitalpebble.com On 20 April 2010 14:00, Joshua J Pavel <jpa...@us.ibm.com> wrote: > I am - I changed the location to a filesystem with lots of free space and > watched disk utilization during a crawl. It'll be a relatively small crawl, > and I have gigs and gigs free. > > [image: Inactive hide details for ---04/19/2010 05:53:53 PM---Are you sure > that you have enough space in the temporary directory used b]---04/19/2010 > 05:53:53 PM---Are you sure that you have enough space in the temporary > directory used by Hadoop? From: Joshua J Pa > > > From: > <arkadi.kosmy...@csiro.au> > To: > <nutch-user@lucene.apache.org> > Date: > 04/19/2010 05:53 PM > Subject: > RE: Hadoop Disk Error > ------------------------------ > > > > Are you sure that you have enough space in the temporary directory used by > Hadoop? > > From: Joshua J Pavel [mailto:jpa...@us.ibm.com, <jpa...@us.ibm.com>] > Sent: Tuesday, 20 April 2010 6:42 AM > To: nutch-user@lucene.apache.org > Subject: Re: Hadoop Disk Error > > > Some more information, if anyone can help: > > If I turn fetcher.parse to "false", then it successfully fetches and crawls > the site. and then bombs out with a larger ID for the job: > > 2010-04-19 20:34:48,342 WARN mapred.LocalJobRunner - job_local_0010 > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any > valid local directory for > taskTracker/jobcache/job_local_0010/attempt_local_0010_m_000000_0/output/spill0.out > at > org.apache.hadoop.fs.LocalDirAllocator $AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335) > at > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite (LocalDirAllocator.java:124) > at > org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite (MapOutputFile.java:107) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill (MapTask.java:930) > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush (MapTask.java:842) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run (LocalJobRunner.java:138) > > So, it's gotta be a problem with the parsing? The pages should all be > UTF-8, and I know there are multiple languages involved. I tried setting > parser.character.encoding.default to match, but it made no difference. I'd > appreciate any ideas. > > [?cid:1__=0ABBFD99DFE290498f9e8a93df938@us.ibm.com]Joshua J > Pavel---04/16/2010 03:05:18 PM---fwiw, the error does seem to be valid: from > the taskTracker/jobcache directory, I only have somethin > > From: > > > Joshua J Pavel/Raleigh/i...@ibmus > > > To: > > > nutch-user@lucene.apache.org > > > Date: > > > 04/16/2010 03:05 PM > > > Subject: > > > Re: Hadoop Disk Error > > ________________________________ > > > > fwiw, the error does seem to be valid: from the taskTracker/jobcache > directory, I only have something for job 1-4. > > ls -la > total 0 > drwxr-xr-x 6 root system 256 Apr 16 19:01 . > drwxr-xr-x 3 root system 256 Apr 16 19:01 .. > drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0001 > drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0002 > drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0003 > drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0004 > > Joshua J Pavel---04/16/2010 09:00:35 AM---We're just now moving from a > nutch .9 installation to 1.0, so I'm not entirely new to this. However > > From: > > > Joshua J Pavel/Raleigh/i...@ibmus > > > To: > > > nutch-user@lucene.apache.org > > > Date: > > > 04/16/2010 09:00 AM > > > Subject: > > > Hadoop Disk Error > > ________________________________ > > > > > > We're just now moving from a nutch .9 installation to 1.0, so I'm not > entirely new to this. However, I can't even get past the first fetch now, > due to a hadoop error. > > Looking in the mailing list archives, normally this error is caused from > either permissions or a full disk. I overrode the use of /tmp by setting > hadoop.tmp.dir to a place with plenty of space, and I'm running the crawl > as root, yet I'm still getting the error below. > > Any thoughts? > > Running on AIX with plenty of disk and RAM. > > 2010-04-16 12:49:51,972 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=0 > 2010-04-16 12:49:52,267 INFO fetcher.Fetcher - -activeThreads=0, > spinWaiting=0, fetchQueues.totalSize=0 > 2010-04-16 12:49:52,268 INFO fetcher.Fetcher - -activeThreads=0, > 2010-04-16 12:49:52,270 WARN mapred.LocalJobRunner - job_local_0005 > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any > valid local directory for > > taskTracker/jobcache/job_local_0005/attempt_local_0005_m_000000_0/output/spill0.out > at org.apache.hadoop.fs.LocalDirAllocator > $AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335) > at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite > (LocalDirAllocator.java:124) > at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite > (MapOutputFile.java:107) > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill > (MapTask.java:930) > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush > (MapTask.java:842) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run > (LocalJobRunner.java:138) > > > >