Re: Running out of disk space during segment merger

Yves Petinot Fri, 09 Apr 2010 07:56:26 -0700

Arkadi,

thanks a lot for these suggestions. Indeed skipping the merge step seemsto be perfectly acceptable as i have a fairly reasonable number ofclusters. As an experiment I also attempted to perform my crawl with asingle segment (which does not even raise the issue of merging) but mycluster doesn't seem to be able to handle the amount of disk spacerequired during the fetch's reduce tasks. Somewhat frustrating as I onlyhave about 1.2M URLs in this crawl.


at any rate thanks again for your comments !

-yp

arkadi.kosmy...@csiro.au wrote:

There are two solutions:

1. Write a light weight version of segment version, which should not be hard if 
you are familiar with Hadoop.
2. Don't merge segments. If you have a reasonable number of segments, even in 
100s, Nutch still can handle this.

Regards,

Arkadi

-----Original Message-----
From: Yves Petinot [mailto:y...@snooth.com]
Sent: Saturday, 27 March 2010 2:21 AM
To: nutch-user@lucene.apache.org
Subject: Re: Running out of disk space during segment merger

Thanks a lot for your reply, Arkadi. It's good to know that this is
indeed a known problem. With the current version, is there anything
more
one can do other than throwing a huge amount of temp disk space at each
node ? Based on my experience and given a replication factor of N,
would
a rule of thumb to reserve roughly N TB per set of 1M URLs make sense ?

-y

arkadi.kosmy...@csiro.au wrote:

Hi Yves,

Yes, what you got is a "normal" result. This issue is discussed every

few months on this list. To my mind, the segment merger is too general.
It assumes that the segments are at arbitrary stages of completion and
works on this assumption. But, this is not a common case at all.
Mostly, people just want to merge finished segments. The algorithm
could be much cheaper in this case.

Regards,

Arkadi

-----Original Message-----
From: Yves Petinot [mailto:y...@snooth.com]
Sent: Friday, 26 March 2010 6:01 AM
To: nutch-user@lucene.apache.org
Subject: Running out of disk space during segment merger

Hi,

I was wondering if some people on the list have been facing issues

with

the segment merge phase, with all the nodes on their hadoop cluster
eventually running out of disk space ? The type errors i'm getting

looks

like this:

java.io.IOException: Task: attempt_201003241258_0005_r_000001_0 - The

reduce copier failed

        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:375)
        at org.apache.hadoop.mapred.Child.main(Child.java:158)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException:

Could not find any valid local directory for
file:/home/snoothbot/nutch/hadoop_tmp/mapred/local/taskTracker/jobcache
/job_201003241258_0005/attempt_201003241258_0005_r_000001_0/output/map_
115.out

at

org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPath
ForWrite(LocalDirAllocator.java:335)

at

org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAll
ocator.java:124)

at

org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(Redu
ceTask.java:2384)

FSError: java.io.IOException: No space left on device

To give a little background, I'm currrently running this on a 3 node
cluster, each node having 500GB drives, which are mostly empty at the
beginning of the process (~400 GB available on each node). The
replication factor is set to 2 and i did also enable Hadoop block
compression. Now, the nutch crawl takes up around 20 GB of disk (with

segments to merge, one of them being 9 GB, the others ranging from 1

to

3 GB in size), so intuitively there should be plenty of space

available

for the merge operation, but we still end up running out of space

during

the reduce phase (7 reduce tasks). I'm currently trying to increase

the

number of reduce tasks to limit the resource/disk consumption of any
given task, but i'm wondering if someone has experienced this type of
issue before and whether there is a better way of approaching it ?

For

instance would using the multiple output segments option useful in
decreasing the amount of temp disk space needed at any given time ?

many thanks in advance,

-yp



--
Yves Petinot
Senior Software Engineer
www.snooth.com
y...@snooth.com

Re: Running out of disk space during segment merger

Reply via email to