Query performance in Lucene 4.x

Desidero Wed, 18 Sep 2013 12:29:33 -0700

Hello,

Over the last few weeks I've been working on upgrading an application from
Lucene 3.x to Lucene 4.x in hopes of improving performance. Unfortunately,
after going through the full migration process and playing with all sorts
of tweaks I found online and in the documentation, Lucene 4 is running
significantly slower than Lucene 3 (~50%). I'm pretty much out of ideas at
this point, and was wondering if anyone else had any suggestions on how to
make it work. I'm not even looking for a big improvement over 3.x anymore;
I'd be happy to just match it and stay on a current release of Lucene.


A bit of information on the system:
- We have one index that is broken into 20 shards - this provided the best
performance in both Lucene 3.x and Lucene 4.x
- The index currently contains ~150 million documents, all of which are
fairly simple and heavily normalized so there are a lot of duplicate
tokens. Only one field (an ID) is stored - the others are not retrievable.
- We have a fixed set of relatively simple queries that are populated with
user input and executed - they are comprised of multiple BooleanQueries,
TermQueries and TermRangeQueries. Some of them are nested, but only a
single level right now.
- We're not doing anything too advanced with results aside from iterating
through the scores and getting the ID fields
- We're using MMapDirectories pointing to index files in a tmpfs
- Our test machines have 94GB of RAM and 64 logical cores

General flow:
1) Request received by socket listener
2) Up to 4 Query objects are generated and populated with normalized user
input (all of the required input for a query must be present or it won't be
executed)
3) Queries are executed in parallel using the Fork/Join framework
   3.a) Subqueries to each shard are executed in parallel using the
IndexSearcher w/ExecutorService
4) Aggregation and other simple post-processing

Other relevant info:
- Indexes were recreated for the 4.x system, but the data is the same. We
tried with the normal Lucene42 codec as well as an extended one that didn't
use compression (per a suggestion on the web)
- In 3.x we used a modified version of the ParallelMultisearcher, in 4.x
we're using the IndexSearcher with an ExecutorService and combining all of
our readers in a MultiReader
- In 3.x we used a ThreadPoolExecutor instead of Fork/Join (Fork/Join
performed better in my tests)

4.x Hot Spots:
Method | Self Time (%) | Self Time (ms)| Self Time (CPU in ms)
java.util.concurrent.CountDownLatch.await() | 11.29% | 140887.219  | 0.0
<- this is just from tcp threads waiting for the real work to finish - you
can ignore it
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.<init>()
| 9.74% | 121594.03  | 121594
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum$Frame.<init>()
| 9.59% | 119680.956  | 119680
org.apache.lucene.codecs.lucene41.ForUtil.readBlock() | 6.91% | 86208.621
| 86208
org.apache.lucene.search.DisjunctionScorer.heapAdjust() | 6.68% |
83332.525  | 83332
java.util.concurrent.ExecutorCompletionService.take() | 5.29% | 66081.499
| 6153
org.apache.lucene.search.DisjunctionSucorer.afterNext() | 4.93% |
61560.872  | 61560
org.apache.lucene.search.Tercorer.advance() | 4.53% | 56530.752  | 56530
java.nio.DirectByteBuffer.get() | 3.96% | 49470.349  | 49470
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.<init>()
| 2.97% | 37051.644  | 37051
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.getFrame()
| 2.77% | 34576.54  | 34576
org.apache.lucene.codecs.MultiLevelSkipListReader.skipTo() | 2.47% |
30767.711  | 30767
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.newTertate() |
2.23% | 27782.522  | 27782
java.net.ServerSocket.accept() | 2.19% | 27380.696  | 0.0
org.apache.lucene.search.DisjunctionSucorer.advance() | 1.82% | 22775.325
| 22775
org.apache.lucene.search.HitQueue.getSentinelObject() | 1.59% | 19869.871
| 19869
org.apache.lucene.store.ByteBufferIndexInput.buildSlice() | 1.43% |
17861.148  | 17861
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.getArc()
| 1.35% | 16813.927  | 16813
org.apache.lucene.search.DisjunctionSucorer.countMatches() | 1.25% |
15603.283  | 15603
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs()
| 1.12% | 13929.646  | 13929
java.util.concurrent.locks.ReentrantLock.lock() | 1.05% | 13145.631  | 8618
org.apache.lucene.util.PriorityQueue.downHeap() | 1.00% | 12513.406  | 12513
java.util.TreeMap.get() | 0.89% | 11070.192  | 11070
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.docs() | 0.80% |
10026.117  | 10026
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum$Frame.decodeMetaData()
| 0.62% | 7746.05  | 7746
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader.iterator() | 0.60%
| 7482.395  | 7482
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.seekExact()
| 0.55% | 6863.069  | 6863
org.apache.lucene.store.DataInput.clone() | 0.54% | 6721.357  | 6721
java.nio.DirectByteBufferR.duplicate() | 0.48% | 5930.226 | 5930
org.apache.lucene.util.fst.ByteSequenceOutputs.read() | 0.46% | 5708.354 |
5708
org.apache.lucene.util.fst.FST.findTargetArc() | 0.45% | 5601.63 | 5601
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock() |
0.45% | 5567.914 | 5567
org.apache.lucene.store.ByteBufferIndexInput.toString() | 0.39% | 4889.302
| 4889
org.apache.lucene.codecs.lucene41.Lucene41SkipReader.<init>() | 0.33% |
4147.285 | 4147
org.apache.lucene.search.TermQuery$TermWeight.scorer() | 0.32% | 4045.912 |
4045
org.apache.lucene.codecs.MultiLevelSkipListReader.<init>() | 0.31% |
3890.399 | 3890
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
| 0.31% | 3886.194 | 3886

If there's any other information you could use, please let me know.

Thanks,
Matt

Query performance in Lucene 4.x

Reply via email to