I got inspired by Uwe Schindler's talk "Is your index reader really atomic or maybe slow?"<http://www.lucenerevolution.org/sites/default/files/Schindler%20-%20IsYourIndexReaderReallyAtomicOrMaybeSlow_0.pdf>to look into the state of parallelism when searching across segments. It seems there has been some effort to do this but we never went all the way.
IndexSearcher has an optional constructor arg for an ExecutorService, which gets used for searching in parallel for call paths where one of the TopDocCollector's is created internally. The per-segment search happens in parallel and then the TopDocs/TopFieldDocs results are merged with locking around the merge bit. What's the upside when performing parallel search? I benchmarked using luceneutil on WIKI_MEDIUM_10M with trunk vs trunk with IndexSearcher constructor hacked to initialize a ForkJoinPool<https://gist.github.com/anonymous/7048089> . Report after iter 19: TaskQPS baseline StdDev QPS patched StdDev Pct diff Respell 63.05 (3.0%) 46.37 (2.9%) -26.5% ( -31% - -21%) PKLookup 255.80 (2.6%) 194.11 (7.4%) -24.1% ( -33% - -14%) Fuzzy2 54.66 (3.8%) 42.28 (3.5%) -22.7% ( -28% - -16%) Fuzzy1 75.35 (2.9%) 59.23 (4.2%) -21.4% ( -27% - -14%) MedSloppyPhrase 141.74 (3.6%) 124.56 (9.7%) -12.1% ( -24% - 1%) AndHighLow 922.70 (3.1%) 908.18 (10.0%) -1.6% ( -14% - 11%) LowSloppyPhrase 88.50 (4.2%) 106.64 (13.0%) 20.5% ( 3% - 39%) LowTerm 694.97 (2.7%) 949.30 (21.6%) 36.6% ( 11% - 62%) MedSpanNear 102.47 (3.0%) 160.94 (12.7%) 57.1% ( 40% - 75%) OrNotHighLow 98.58 (4.9%) 155.22 (18.1%) 57.5% ( 32% - 84%) AndHighMed 260.31 (1.7%) 448.19 (23.5%) 72.2% ( 46% - 99%) OrHighLow 101.68 (6.7%) 179.91 (27.8%) 76.9% ( 39% - 119%) MedTerm 200.49 (3.1%) 365.70 (32.8%) 82.4% ( 45% - 122%) HighTerm 145.36 (3.3%) 268.51 (27.7%) 84.7% ( 51% - 119%) Prefix3 81.59 (2.8%) 153.52 (24.8%) 88.2% ( 58% - 119%) OrHighNotLow 50.67 (6.9%) 95.91 (22.5%) 89.3% ( 56% - 127%) OrHighMed 93.37 (6.3%) 182.37 (24.3%) 95.3% ( 60% - 134%) OrHighNotMed 78.19 (6.7%) 153.30 (30.7%) 96.1% ( 54% - 143%) OrHighHigh 46.59 (7.0%) 92.33 (31.7%) 98.2% ( 55% - 147%) OrHighNotHigh 51.01 (5.7%) 105.32 (25.1%) 106.5% ( 71% - 145%) Wildcard 78.53 (3.7%) 168.42 (40.8%) 114.5% ( 67% - 164%) OrNotHighHigh 40.42 (5.5%) 93.45 (29.5%) 131.2% ( 91% - 175%) OrNotHighMed 24.40 (4.8%) 57.00 (22.0%) 133.6% ( 102% - 168%) HighSpanNear 4.13 (3.6%) 10.09 (10.5%) 144.3% ( 125% - 164%) LowSpanNear 14.27 (2.3%) 35.24 (17.0%) 147.0% ( 124% - 170%) IntNRQ 8.73 (4.8%) 21.69 (23.6%) 148.6% ( 114% - 185%) LowPhrase 29.11 (2.9%) 72.87 (22.4%) 150.3% ( 121% - 180%) MedPhrase 33.19 (3.4%) 83.38 (40.4%) 151.3% ( 103% - 201%) HighSloppyPhrase 6.46 (5.8%) 16.37 (21.9%) 153.5% ( 119% - 192%) HighPhrase 10.00 (6.8%) 25.66 (14.4%) 156.7% ( 126% - 190%) AndHighHigh 44.82 (1.1%) 115.77 (39.6%) 158.3% ( 116% - 201%) That looks pretty awesome! Any feedback from existing users? * * *Problems* - (relatively minor) Solr can't take advantage of this yet. SolrIndexSearcher's disregard the possibility of propagating an ExecutorService up to IndexSearcher constructor. - Even if it did, as far as I can tell Solr does not make use of IndexSearcher's search() variants that create TopDocCollector's internally and parallelize in the presence of the executor. - If arbitary Collector args come into play, we don't/can't parallelize. Note that even if ultimately results are going to a TopDocCollector it may be wrapped inside e.g. a EarlyTerminatingCollector or TimeLimitingCollector or both. - The special-casing with parallelism layered on top as IndexSearcher does with TopDocCollector's does not scale, there are lots* *of Collector's that could be potentially lend themselves to parallelism. *Idea* * * I have been exploring a refactoring of Collector's that allows for parallelization at the level of the collection protocol. Some of the design decisions: - easy migration path for collectors that want to remain serial - the parallelization should be composable (when Collector's wrap other Collector's) - allow collector's to pick the optimal solution (e.g. there might be memory tradeoffs to be made) by advising the collector about whether a search will be parallelized - encourage use of lock-free parallelism by providing for a segment-level done() method that is guaranteed to execute in a single-threaded manner (so if you were to accumulate some state at the segment-level, it can be safely merged with the composite state without use of locking) The code is currently in a fork on github <https://github.com/shikhar/lucene-solr/compare/apache:trunk...trunk>. - Collector goes from being an abstract class to this interface <https://github.com/shikhar/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/search/Collector.java>. A new interface SubCollector <https://github.com/shikhar/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/search/SubCollector.java> is introduced. - The existing subclasses of Collector can choose to either implement the new interface, or subclass a SerialCollector <https://github.com/shikhar/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/search/SerialCollector.java> abstract class which maintains essentially the same API as before. - There is javadoc on the new interfaces, but the new collection protocol is probably clearer in connection with the usage from IndexSearcher <https://github.com/shikhar/lucene-solr/blob/b3098fc/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L571-L636>. - I tried to convert at least all of the collectors that wrap other collectors over to the new interface, rather than simply extending SerialCollector (also introduced a new WrappingCollector <https://github.com/shikhar/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/search/WrappingCollector.java> base class), so that the wrappers themselves aren't blockers from a search being parallelizable. There are probably a few cases I have missed. - An example of a collector that is very easily parallelizable in the divide-and-conquer style: TotalHitCountCollector <https://github.com/shikhar/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollector.java> - An update of TopScoreDocCollector to being a parallelizable Collector <https://github.com/shikhar/lucene-solr/commit/b3098fcfdeb302481e93aab93eb77d0cabc72f9b?w=1>, and corresponding removal of the special-cased parallelism inside IndexSearcher. Benchmark results <https://gist.github.com/shikhar/7062026> *Next steps* * * I would love to get your feedback on the idea and implementation, and how this could be incorporated into trunk. Code review would be much appreciated. Some follow-up work that needs to be done (help needed!): - Making TopFieldCollector a parallelizable Collector, like was done for TopScoreDocCollector, and removal of the special-casing inside IndexSearcher. - Solr support -- perhaps making the Executor to be used by SolrIndexSearcher configurable from solrconfig.xml. - Making more collector's parallelizable - lots of opportunities here. I guess this will be an ongoing process. Looking forward to your thoughts! Thanks, Shikhar Bhushan