I am crawling daily and putting each new crawl into a new directory every
day using the "nutch crawl" command.

Lets say i get up to having a bunch(20+) of crawls(793 MB each) crawled. I
can merge all of their segments into 1 large  segment (3.4 GB) at the same
time with no problem in approx 3 hours or so. The problem happens when I add
an additional crawl. If i try to merge segments of a new individual craw
(793 MB)l to the existing merged segments, the merge takes 13+ hours and
almost 500GB, at which point i usually cancel because I believe something is
wrong. I would appreciate if anyone has some insight as to what is going
wrong here and also any tips on how to improve/fix this.

Thanks,
Mina

Reply via email to