Isn't it better for Dan to skip the optimization phase before merging? I am not sure, but he could save some time on this (if he has enough file handles for that, of course).
It depends. If you have 10 machines, each with a single disk, that you use for indexing in parallel, and copy all of the indexes to a single machine for the final merge, then you're probably better off optimizing each index before copying it and merging it with the others, in order to maximize the amount of work done in parallel, using all disk spindles. However, if instead you have one machine with ten processors and a filesystem striped across ten disks, then, in theory, optimizing before merging might not help much, since the single-threaded final merge could use all ten disks at once. Even then, though the final merge would be doing some CPU work serially which would have been done in parallel in the first configuration. In general I think it's best to do as much work as possible in parallel.
> What strategy do you use in "nutch"?
Nutch builds optimized indexes for each fetched "segment" (n.b., a Nutch segment is different than a Lucene segment) and only merges segment indexes as the final step before deploying them for searching. Nutch has a rolling set of active segments: the oldest are periodically discarded and replaced with newly fetched segments. Before a new set of segments is deployed, duplicate elimination processing must occur, which marks duplicates as deleted prior to merging new production indexes.
Doug
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
