Re: [Nutch-general] How to recude the tmp disk space usage during linkdb process?

Sami Siren Wed, 11 Apr 2007 09:48:54 -0700

Sean Dean wrote:
> I think the general rule is you will require about 2.5 to 3 times the size of 
> the final product. This is due to Hadoop creating the reduce files after the 
> maps are produced, before the maps can be removed.
>  
> I'm not aware of any way to change this, I think its just "normal" 
> functionality.


The space consumption is at its worst on single machine configuration
where you have to process all the data on 1 machine. If you have more
machines to spare then the space required per machine can (obviously) be
divided roughly by the amount of machines.

I think the only way to cut down your temp size requirements (after
compression, I think it's possible to compress the temp data?) is to do
your work in smaller slices.

--
 Sami Siren
> 
>  
> ----- Original Message ----
> From: qi wu <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Wednesday, April 11, 2007 10:41:35 AM
> Subject: Re: How to recude the tmp disk space usage during linkdb process?
> 
> 
> One more general questions related with this issue is :How to estimate the  
> tmp space required by the overall process which include fetching,update 
> crawldb,building linkdb and indexing ?
> For my case, 20G space for crawdb and all segments require more than 36G 
> space for linking DB tmp space, sounds unreasonable!
> 
> ----- Original Message ----- 
> From: "qi wu" <[EMAIL PROTECTED]>
> To: <[email protected]>
> Sent: Wednesday, April 11, 2007 10:15 PM
> Subject: Re: How to recude the tmp disk space usage during linkdb process?
> 
> 
>> it's impossible for me to change to 0.9 now,anyway ,thank you!
>>
>> ----- Original Message ----- 
>> From: "Sean Dean" <[EMAIL PROTECTED]>
>> To: <[email protected]>
>> Sent: Wednesday, April 11, 2007 9:33 PM
>> Subject: Re: How to recude the tmp disk space usage during linkdb process?
>>
>>
>>> Nutch 0.9 can apply zlib or lzo2 compression on your linkdb (and crawldb) 
>>> to reduce overall space. The average compression ratio using zlib is about 
>>> 6:1 on those two databases and doesn't slow additions or segment creation 
>>> down.
>>>
>>> Keep in mind, this currently only works officially on Linux and 
>>> unofficially on FreeBSD.
>>>
>>>
>>> ----- Original Message ----
>>> From: qi wu <[EMAIL PROTECTED]>
>>> To: [email protected]
>>> Sent: Wednesday, April 11, 2007 9:01:30 AM
>>> Subject: How to recude the tmp disk space usage during linkdb process?
>>>
>>>
>>> Hi,
>>>  I have cralwed nearly 3millon pages which are kept in 13 segements and 
>>> there have 10millon entries in the crawldb. I use Nuth.81 in a sngle Linux 
>>> box,currently the disk occupied by crawldb and segments is about 20G ,and 
>>> the machine still have 36G space left. I always failed in building linkdb, 
>>> and the error was caused by no space left for reducing process, the 
>>> exception is listed below:
>>> job_f506pk
>>> org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
>>>        at 
>>> org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFileSystem.java:150)
>>>        at 
>>> org.apache.hadoop.fs.FSDataOutputStream$Summer.write(FSDataOutputStream.java:83)
>>>        at 
>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:112)
>>>        at 
>>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>>>        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>>        at 
>>> org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:208)
>>>        at 
>>> org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:913)
>>>        at 
>>> org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:800)
>>>        at 
>>> org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:738)
>>>        at 
>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:112)
>>>        at 
>>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>>>        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>>        at 
>>> org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:208)
>>>        at 
>>> org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:913)
>>>        at 
>>> org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:800)
>>>        at 
>>> org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:738)
>>>        at 
>>> org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:542)
>>>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:218)
>>>        at 
>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:112)
>>>
>>> I wonder why so much space are required by linkdb reduce job, can I config 
>>> some nutch or hadoop setting to reduce the disk space usage for linkdb? Any 
>>> hints for me to overcome the problem? //bow
>>>
>>> Thanks
>>> -Qi


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] How to recude the tmp disk space usage during linkdb process?

Reply via email to