Hello, Over the last few weeks I've been working on upgrading an application from Lucene 3.x to Lucene 4.x in hopes of improving performance. Unfortunately, after going through the full migration process and playing with all sorts of tweaks I found online and in the documentation, Lucene 4 is running significantly slower than Lucene 3 (~50%). I'm pretty much out of ideas at this point, and was wondering if anyone else had any suggestions on how to make it work. I'm not even looking for a big improvement over 3.x anymore; I'd be happy to just match it and stay on a current release of Lucene.
A bit of information on the system: - We have one index that is broken into 20 shards - this provided the best performance in both Lucene 3.x and Lucene 4.x - The index currently contains ~150 million documents, all of which are fairly simple and heavily normalized so there are a lot of duplicate tokens. Only one field (an ID) is stored - the others are not retrievable. - We have a fixed set of relatively simple queries that are populated with user input and executed - they are comprised of multiple BooleanQueries, TermQueries and TermRangeQueries. Some of them are nested, but only a single level right now. - We're not doing anything too advanced with results aside from iterating through the scores and getting the ID fields - We're using MMapDirectories pointing to index files in a tmpfs - Our test machines have 94GB of RAM and 64 logical cores General flow: 1) Request received by socket listener 2) Up to 4 Query objects are generated and populated with normalized user input (all of the required input for a query must be present or it won't be executed) 3) Queries are executed in parallel using the Fork/Join framework 3.a) Subqueries to each shard are executed in parallel using the IndexSearcher w/ExecutorService 4) Aggregation and other simple post-processing Other relevant info: - Indexes were recreated for the 4.x system, but the data is the same. We tried with the normal Lucene42 codec as well as an extended one that didn't use compression (per a suggestion on the web) - In 3.x we used a modified version of the ParallelMultisearcher, in 4.x we're using the IndexSearcher with an ExecutorService and combining all of our readers in a MultiReader - In 3.x we used a ThreadPoolExecutor instead of Fork/Join (Fork/Join performed better in my tests) 4.x Hot Spots: Method | Self Time (%) | Self Time (ms)| Self Time (CPU in ms) java.util.concurrent.CountDownLatch.await() | 11.29% | 140887.219 | 0.0 <- this is just from tcp threads waiting for the real work to finish - you can ignore it org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.<init>() | 9.74% | 121594.03 | 121594 org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum$Frame.<init>() | 9.59% | 119680.956 | 119680 org.apache.lucene.codecs.lucene41.ForUtil.readBlock() | 6.91% | 86208.621 | 86208 org.apache.lucene.search.DisjunctionScorer.heapAdjust() | 6.68% | 83332.525 | 83332 java.util.concurrent.ExecutorCompletionService.take() | 5.29% | 66081.499 | 6153 org.apache.lucene.search.DisjunctionSucorer.afterNext() | 4.93% | 61560.872 | 61560 org.apache.lucene.search.Tercorer.advance() | 4.53% | 56530.752 | 56530 java.nio.DirectByteBuffer.get() | 3.96% | 49470.349 | 49470 org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.<init>() | 2.97% | 37051.644 | 37051 org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.getFrame() | 2.77% | 34576.54 | 34576 org.apache.lucene.codecs.MultiLevelSkipListReader.skipTo() | 2.47% | 30767.711 | 30767 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.newTertate() | 2.23% | 27782.522 | 27782 java.net.ServerSocket.accept() | 2.19% | 27380.696 | 0.0 org.apache.lucene.search.DisjunctionSucorer.advance() | 1.82% | 22775.325 | 22775 org.apache.lucene.search.HitQueue.getSentinelObject() | 1.59% | 19869.871 | 19869 org.apache.lucene.store.ByteBufferIndexInput.buildSlice() | 1.43% | 17861.148 | 17861 org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.getArc() | 1.35% | 16813.927 | 16813 org.apache.lucene.search.DisjunctionSucorer.countMatches() | 1.25% | 15603.283 | 15603 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs() | 1.12% | 13929.646 | 13929 java.util.concurrent.locks.ReentrantLock.lock() | 1.05% | 13145.631 | 8618 org.apache.lucene.util.PriorityQueue.downHeap() | 1.00% | 12513.406 | 12513 java.util.TreeMap.get() | 0.89% | 11070.192 | 11070 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.docs() | 0.80% | 10026.117 | 10026 org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum$Frame.decodeMetaData() | 0.62% | 7746.05 | 7746 org.apache.lucene.codecs.BlockTreeTerReader$FieldReader.iterator() | 0.60% | 7482.395 | 7482 org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.seekExact() | 0.55% | 6863.069 | 6863 org.apache.lucene.store.DataInput.clone() | 0.54% | 6721.357 | 6721 java.nio.DirectByteBufferR.duplicate() | 0.48% | 5930.226 | 5930 org.apache.lucene.util.fst.ByteSequenceOutputs.read() | 0.46% | 5708.354 | 5708 org.apache.lucene.util.fst.FST.findTargetArc() | 0.45% | 5601.63 | 5601 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock() | 0.45% | 5567.914 | 5567 org.apache.lucene.store.ByteBufferIndexInput.toString() | 0.39% | 4889.302 | 4889 org.apache.lucene.codecs.lucene41.Lucene41SkipReader.<init>() | 0.33% | 4147.285 | 4147 org.apache.lucene.search.TermQuery$TermWeight.scorer() | 0.32% | 4045.912 | 4045 org.apache.lucene.codecs.MultiLevelSkipListReader.<init>() | 0.31% | 3890.399 | 3890 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() | 0.31% | 3886.194 | 3886 If there's any other information you could use, please let me know. Thanks, Matt