Hello,

I configured nutch to crawl and index my intranet periodically, and
now I'm trying to find the ideal merge process. I've looked in the
list achive and find a discussion about it (please see below), but I
still have one question : The solution #2 was kind of standad as I've
noticed, but my problem is, when I have lots of segment dirs, I start
to have "Too many open files" exception.
So I need to merge them, and by doing that, do I need to index it
again? Because it is an expensive process to index all the content,
and I have it already indexed in the segment dirs.
Can't I used the merged index created by "./nutch merge" facility? The
problem that I've found is that the merged index that I created
(solution 2) is pointing to the old segments. Can't I "update" the
index to point to the new fresh merged segment?
Shouldn't the "./nutch mergesegs" create a merged index? i'm kind of
confused with this.. :-)

Best regards,
Leonardo Barbosa.

>From [EMAIL PROTECTED]
Thu Mar 10 18:58:58 2005

> Should I :
> 
> 1) merge all the segments and then index them, or 
> 2) Should I index each segment individually and then merge the indexes,
> keeping the segments separate. Or 
> 3) Should I index each segment separately, and keep both segments and
> indexes separate, and search across multiple indexes (but I have heard
> there are issues with the ranking) 

Option #3 is not really that great.  You get better performance with a 
merged index.  Option #1 would be more work with having to merge the 
segments, and I'm not sure that there is a real advantage to doing that 
over option #2.  Option #2 is what most people do.

Luke

Reply via email to