There are several reports on the mailing list about this, basically merging
segments is very consumming in ressources HD  & CPU...
To help it a few things can be done :
+ use compress mode for hadoop file system
+ use (pseudo) distributed mode since it will map & reduce in parallel using
less HD

I just gave up on trying and only merge the indexes... meaning that you need
to delete segments when there are getting old

2009/9/29 Mina Azib <mina.a...@gmail.com>

> I am crawling daily and putting each new crawl into a new directory every
> day using the "nutch crawl" command.
>
> Lets say i get up to having a bunch(20+) of crawls(793 MB each) crawled. I
> can merge all of their segments into 1 large  segment (3.4 GB) at the same
> time with no problem in approx 3 hours or so. The problem happens when I
> add
> an additional crawl. If i try to merge segments of a new individual craw
> (793 MB)l to the existing merged segments, the merge takes 13+ hours and
> almost 500GB, at which point i usually cancel because I believe something
> is
> wrong. I would appreciate if anyone has some insight as to what is going
> wrong here and also any tips on how to improve/fix this.
>
> Thanks,
> Mina
>



-- 
-MilleBii-

Reply via email to