Hello, I configured nutch to crawl and index my intranet periodically, and now I'm trying to find the ideal merge process. I've looked in the list achive and find a discussion about it (please see below), but I still have one question : The solution #2 was kind of standad as I've noticed, but my problem is, when I have lots of segment dirs, I start to have "Too many open files" exception. So I need to merge them, and by doing that, do I need to index it again? Because it is an expensive process to index all the content, and I have it already indexed in the segment dirs. Can't I used the merged index created by "./nutch merge" facility? The problem that I've found is that the merged index that I created (solution 2) is pointing to the old segments. Can't I "update" the index to point to the new fresh merged segment? Shouldn't the "./nutch mergesegs" create a merged index? i'm kind of confused with this.. :-)
Best regards, Leonardo Barbosa. >From [EMAIL PROTECTED] Thu Mar 10 18:58:58 2005 > Should I : > > 1) merge all the segments and then index them, or > 2) Should I index each segment individually and then merge the indexes, > keeping the segments separate. Or > 3) Should I index each segment separately, and keep both segments and > indexes separate, and search across multiple indexes (but I have heard > there are issues with the ranking) Option #3 is not really that great. You get better performance with a merged index. Option #1 would be more work with having to merge the segments, and I'm not sure that there is a real advantage to doing that over option #2. Option #2 is what most people do. Luke
