I recently migrated a legacy Lucene application from 2.3 to 3.5. The code was filled with numerous custom filter/analyzers/similarites/collectors. Took about a week to convert all the token streams to the new API and removed deprecated classes. Most importantly, there is a collector that enables faceting, which I suspect might be taken from Solr (never looked into the Solr source code).
The index is built as a batch process with no searchers using it. The index contains 30+million documents for a total size around 45gb. The bulk of the indexing time is during the database calls. The build time using Lucene 2.3 was around 10 hours. The code has a collector similar to TimeLimitingCollector (sadly, there is a ton of custom built code) which collects documents until it reaches a limit. The way the current index is created, it is essential that the most important documents (based on business rules) exist at the beginning of an index (insertion order) to ensure that the appear even if the collector times out. The first issue we noticed is that this distribution (which I admit is a hack) is no longer "correct" using the default TieredMergePolicy. We switched back the log policy to the existing setup of LogByteSizeMergePolicy with a merge factor of 2. I am assuming the low merge factor is responsible for creating indices that respect the insertion order of documents. Documents are now in the correct order, but a optimize (aka forceMerge(1)) takes around 5 hours were previously there was no slowdown. If we remove the forceMerge, the commit time takes just as long. It is difficult to iterate through different settings since waiting 14-15 hours between tests to see the results is too long. What is the best way to create an optimized index that places documents based on insertion order at the beginning? The answer should be to write better queries, but none of the authors of this legacy jumbled code base are around and we want to avoid rocking the boat on the query side since the existing search results are satisfactory. Cheers, Ivan --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org