There are several reports on the mailing list about this, basically merging segments is very consumming in ressources HD & CPU... To help it a few things can be done : + use compress mode for hadoop file system + use (pseudo) distributed mode since it will map & reduce in parallel using less HD
I just gave up on trying and only merge the indexes... meaning that you need to delete segments when there are getting old 2009/9/29 Mina Azib <mina.a...@gmail.com> > I am crawling daily and putting each new crawl into a new directory every > day using the "nutch crawl" command. > > Lets say i get up to having a bunch(20+) of crawls(793 MB each) crawled. I > can merge all of their segments into 1 large segment (3.4 GB) at the same > time with no problem in approx 3 hours or so. The problem happens when I > add > an additional crawl. If i try to merge segments of a new individual craw > (793 MB)l to the existing merged segments, the merge takes 13+ hours and > almost 500GB, at which point i usually cancel because I believe something > is > wrong. I would appreciate if anyone has some insight as to what is going > wrong here and also any tips on how to improve/fix this. > > Thanks, > Mina > -- -MilleBii-