Re: mergesegs disk space

reinhard schwab Wed, 29 Jul 2009 03:59:15 -0700

Doğacan Güney schrieb:
> On Wed, Jul 29, 2009 at 13:11, reinhard schwab<reinhard.sch...@aon.at> wrote:
>   
>> Doğacan Güney schrieb:
>>     
>>> On Tue, Jul 21, 2009 at 21:50, Tomislav Poljak<tpol...@gmail.com> wrote:
>>>
>>>       
>>>> Hi,
>>>> thanks for your answers, I've configured compression:
>>>>
>>>> mapred.output.compress = true
>>>> mapred.compress.map.output = true
>>>> mapred.output.compression.type= BLOCK
>>>>
>>>> ( in xml format in hadoop-site.xml )
>>>>
>>>> and it works (and uses less disk space, no more out of disk space
>>>> exception), but merging now takes a really long time. My next question
>>>> is simple:
>>>> Is segment merging necessary step (if I don't need all in one segment
>>>> and do not have optional filtering) or is it ok to proceed with
>>>> indexing ? I ask because many tutorials and most re-crawl scripts have
>>>> this step.
>>>>
>>>>
>>>>         
>>> Not really. But if you recrawl a lot, old versions of pages will stay
>>> on your disk
>>> taking unnecessary space.
>>>
>>> To improve compression speed, take a look at:
>>>
>>> http://code.google.com/p/hadoop-gpl-compression/
>>>
>>> Lzo (de)compression is *very* fast.
>>>
>>>       
>> i also experience that segement merge heavily requires resources such as
>> cpu and disc although
>> the document collection crawled so far is very small, ~ 25000. segments
>> contains data of 650 mb.
>> its really a showstopper for me.
>> it would be very helpful to have a faq entry or some documentation
>> about how to improve the performance of the segment merge task.
>>
>>     
>
> You may be interested in:
>
> http://issues.apache.org/jira/browse/NUTCH-650
>
> With hbase integration, we completely do away with many stuff like
> segment merging.
>
> I intend to commit initial hbase code to a nutch branch this week (and
> write a wiki guide
> about it). Many features are missing but code should be stable enough to test.
>   
sounds great.
seems to be a bigger patch! ;))
will be happy to try and test it...
the more io operations are reduced the better the performance will be.
my disc will also be happy having not so much stress.


>   
>>>> Tomislav
>>>>
>>>>
>>>> On Wed, 2009-07-15 at 21:04 +0300, Doğacan Güney wrote:
>>>>
>>>>         
>>>>> On Wed, Jul 15, 2009 at 20:45, MilleBii<mille...@gmail.com> wrote:
>>>>>
>>>>>           
>>>>>> Are you on a single node conf ?
>>>>>> If yes I have the same problem, and some people have suggested earlier to
>>>>>> use the hadoop pseudo-distributed config on a single server.
>>>>>> Others have also suggested to use compress mode of hadoop.
>>>>>>
>>>>>>             
>>>>> Yes, that's a good point. Playing around with these options may help:
>>>>>
>>>>> mapred.output.compress
>>>>>
>>>>> mapred.output.compression.type (BLOCK may help a lot here)
>>>>> advices
>>>>> mapred.compress.map.output
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>>> But I have not been able to make it work on my PC because I get bogged 
>>>>>> down
>>>>>> by some windows/hadoop compatibility issues.
>>>>>> If you are on Linux you may be more lucky, interested by your results by 
>>>>>> the
>>>>>> way, so I know if when moving to Linux I get those problems solved.
>>>>>>
>>>>>>
>>>>>> 2009/7/15 Doğacan Güney <doga...@gmail.com>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> On Wed, Jul 15, 2009 at 19:31, Tomislav Poljak<tpol...@gmail.com> wrote:
>>>>>>>
>>>>>>>               
>>>>>>>> Hi,
>>>>>>>> I'm trying to merge (using nutch-1.0 mergesegs) about 1.2MM pages on 
>>>>>>>> one
>>>>>>>> machine contained in 10 segments, using:
>>>>>>>>
>>>>>>>> bin/nutch mergesegs crawl/merge_seg -dir crawl/segments
>>>>>>>>
>>>>>>>> ,but there is not enough space on 500G disk to complete this merge task
>>>>>>>> (getting java.io.IOException: No space left on device in hadoop.log)
>>>>>>>>
>>>>>>>> Shouldn't 500G be enough disk space for this merge? Is this a bug? If
>>>>>>>> this is not a bug, how much disk space is required for this merge?
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>> A lot :)
>>>>>>>
>>>>>>> Try deleting your hadoop temporary folders. If that doesn't help you
>>>>>>> may try merging
>>>>>>> segment parts one by one. For example, move your content/ directories
>>>>>>> and try merging
>>>>>>> again. If successful you can then merge contents later and move the
>>>>>>> resulting content/ into
>>>>>>> your merge_seg dir.
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> Tomislav
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>> --
>>>>>>> Doğacan Güney
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> --
>>>>>> -MilleBii-
>>>>>>
>>>>>>
>>>>>>             
>>>>>           
>>>
>>>
>>>       
>>     
>
>
>
>

Re: mergesegs disk space

Reply via email to