Matt Craig wrote:
bin/nutch merge segments/index seg1/* seg2/*

I had thought that "segments" would be a directory that I could now use as my segments directory to search through Tomcat, but instead of the directory having "fetcher*" and "index*" in it, it looks like it is just the index directory.

This is a good question, an issue that needs both documentation and should be fixed.


The 'merge' command just merges indexes. The other stuff in your segments directory is still needed. So if, for example, your segments are in a directory called 'segments', they've all been indexed, and you've run duplicate detection, then you're ready to merge.

You merge with something like:

bin/nutch merge . segments/*

This creates a merged index, containing the contents of all of the segments/*/index, in a new directory named after those segments, in your case '20030422113844-0_20030423144418-2'.

Here's the bug. NutchBean looks for a merged index in a directory named 'index'. So, to make things work, you currently have to manually rename the merged index directory to be just 'index':

mv 20030422113844-0_20030423144418-2 index

If you run Tomcat while connected to a directory with subdirectories named 'index' and 'segments', it will use the merged index data in 'index' and get the rest of the segment data from the 'segments' directory. Searches are much faster with a merged index.

The fix is that the merge code should not generate an index name. It should either use the name given on the command line, or better yet, not take a name on the command line and always use 'index'. Would that make it a bit less confusing? We also need better documentation for this kind of stuff...

Cheers,

Doug


------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to