I am - I changed the location to a filesystem with lots of free space and
watched disk utilization during a crawl.  It'll be a relatively small
crawl, and I have gigs and gigs free.


|------------>
| From:      |
|------------>
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
  |<arkadi.kosmy...@csiro.au>                                                   
                                                                     |
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To:        |
|------------>
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
  |<nutch-user@lucene.apache.org>                                               
                                                                     |
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date:      |
|------------>
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
  |04/19/2010 05:53 PM                                                          
                                                                     |
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject:   |
|------------>
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
  |RE: Hadoop Disk Error                                                        
                                                                     |
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|





Are you sure that you have enough space in the temporary directory used by
Hadoop?

From: Joshua J Pavel [mailto:jpa...@us.ibm.com]
Sent: Tuesday, 20 April 2010 6:42 AM
To: nutch-user@lucene.apache.org
Subject: Re: Hadoop Disk Error


Some more information, if anyone can help:

If I turn fetcher.parse to "false", then it successfully fetches and crawls
the site. and then bombs out with a larger ID for the job:

2010-04-19 20:34:48,342 WARN mapred.LocalJobRunner - job_local_0010
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
valid local directory for
taskTracker/jobcache/job_local_0010/attempt_local_0010_m_000000_0/output/spill0.out

at org.apache.hadoop.fs.LocalDirAllocator
$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite
(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite
(MapOutputFile.java:107)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill
(MapTask.java:930)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:842)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)

So, it's gotta be a problem with the parsing? The pages should all be
UTF-8, and I know there are multiple languages involved. I tried setting
parser.character.encoding.default to match, but it made no difference. I'd
appreciate any ideas.

[?cid:1__=0ABBFD99DFE290498f9e8a93df938@us.ibm.com]Joshua J
Pavel---04/16/2010 03:05:18 PM---fwiw, the error does seem to be valid:
from the taskTracker/jobcache directory, I only have somethin

From:


Joshua J Pavel/Raleigh/i...@ibmus


To:


nutch-user@lucene.apache.org


Date:


04/16/2010 03:05 PM


Subject:


Re: Hadoop Disk Error

________________________________



fwiw, the error does seem to be valid: from the taskTracker/jobcache
directory, I only have something for job 1-4.

ls -la
total 0
drwxr-xr-x 6 root system 256 Apr 16 19:01 .
drwxr-xr-x 3 root system 256 Apr 16 19:01 ..
drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0001
drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0002
drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0003
drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0004

Joshua J Pavel---04/16/2010 09:00:35 AM---We're just now moving from a
nutch .9 installation to 1.0, so I'm not entirely new to this. However

From:


Joshua J Pavel/Raleigh/i...@ibmus


To:


nutch-user@lucene.apache.org


Date:


04/16/2010 09:00 AM


Subject:


Hadoop Disk Error

________________________________





We're just now moving from a nutch .9 installation to 1.0, so I'm not
entirely new to this.  However, I can't even get past the first fetch now,
due to a hadoop error.

Looking in the mailing list archives, normally this error is caused from
either permissions or a full disk.  I overrode the use of /tmp by setting
hadoop.tmp.dir to a place with plenty of space, and I'm running the crawl
as root, yet I'm still getting the error below.

Any thoughts?

Running on AIX with plenty of disk and RAM.

2010-04-16 12:49:51,972 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2010-04-16 12:49:52,267 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2010-04-16 12:49:52,268 INFO  fetcher.Fetcher - -activeThreads=0,
2010-04-16 12:49:52,270 WARN  mapred.LocalJobRunner - job_local_0005
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
valid local directory for
taskTracker/jobcache/job_local_0005/attempt_local_0005_m_000000_0/output/spill0.out

      at org.apache.hadoop.fs.LocalDirAllocator
$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
      at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite
(LocalDirAllocator.java:124)
      at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite
(MapOutputFile.java:107)
      at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill
(MapTask.java:930)
      at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush
(MapTask.java:842)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
      at org.apache.hadoop.mapred.LocalJobRunner$Job.run
(LocalJobRunner.java:138)


Reply via email to