Thanks Doug. This has been helpful. But now I have another question.

Since our corpus has a lot of large segments is it possible to merge them incrementally? For example, if there are 30 segments, could it be done in groups of five and then the resulting six indexes merged again into one. One reason for doing this is that it takes a while to merge all of them and it would be nice to test some along the way.

If that is possible, could someone give a few pointers.

matt

[EMAIL PROTECTED] wrote:
Message: 2
Date: Fri, 19 Mar 2004 08:31:02 -0800
From: Doug Cutting <[EMAIL PROTECTED]>
To:  [EMAIL PROTECTED]
Subject: Re: [Nutch-general] Merging segemnts
Reply-To: [EMAIL PROTECTED]

Matt Craig wrote:

bin/nutch merge segments/index seg1/* seg2/*

I had thought that "segments" would be a directory that I could now use as my segments directory to search through Tomcat, but instead of the directory having "fetcher*" and "index*" in it, it looks like it is just the index directory.


This is a good question, an issue that needs both documentation and should be fixed.

The 'merge' command just merges indexes. The other stuff in your segments directory is still needed. So if, for example, your segments are in a directory called 'segments', they've all been indexed, and you've run duplicate detection, then you're ready to merge.

You merge with something like:

bin/nutch merge . segments/*

This creates a merged index, containing the contents of all of the segments/*/index, in a new directory named after those segments, in your case '20030422113844-0_20030423144418-2'.

Here's the bug. NutchBean looks for a merged index in a directory named 'index'. So, to make things work, you currently have to manually rename the merged index directory to be just 'index':

mv 20030422113844-0_20030423144418-2 index

If you run Tomcat while connected to a directory with subdirectories named 'index' and 'segments', it will use the merged index data in 'index' and get the rest of the segment data from the 'segments' directory. Searches are much faster with a merged index.

The fix is that the merge code should not generate an index name. It should either use the name given on the command line, or better yet, not take a name on the command line and always use 'index'. Would that make it a bit less confusing? We also need better documentation for this kind of stuff...

Cheers,

Doug




-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to