[ https://issues.apache.org/jira/browse/LUCENE-5049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679350#comment-13679350 ]
Uwe Schindler commented on LUCENE-5049: --------------------------------------- Hi Mike, I agree with Robert and Jack - this is like comparing apples and pies. We are back at the same place like 4 (!) years ago when everybody added bulk APIs and you posted a highly optimized special case with all l☃☃ps-unr☃lled™ (LUCENE-1594). This is comparing apples with pies: You use the specialized MMapDirectory which is really a lot faster, so a lot of the improvements also come from there. From most customers I have seen, the "OR" case with pure term queries is not the most common one (although it should in reality, but users want "and" - maybe because our default scoring is bad - other story?). I am completely against the idea to have this anywhere in Lucene, same for NativeMMapDirectory (and I am not happy with NativeLinux/WindowsDirectory, too - although they are so special that they have some reason to exist). I would never suggest anybody to actually use this in production, it is too risky. If you want to release this code, its easy: Create a Google Code project and do it outside of Lucene. All interconnection points here are through reflection, so it can be completely separate. I definitely will not post you results anywhere in twitter, because doing this would create another shitstorm against Lucene, Java, Hotspot, and C++ - especially because the results here have nothing to do with Java vs. C++ - its just specialization, nothing more. As Robert said, you can do the same with pure Java (see LUCENE-1594). The only possible way to bring C code back into the game is to bring CLucene back to live! bq. Seriously, a second question: What about alternative JVM-based languages? I mean, maybe Java does have excess baggage related to its quirky semantics, but could the raw JVM support a lower-level implementation of BQ, without leaving the JVM... "bubble"? OTOH, maybe different JVM's could have different performance characteristics. I don't see any change in performance here, as other JVM-based languages produce the same bytecode like javac, just from another source code. Java bytecode is flexible but not too flexible. The optimizations are done by hotspot and those bytecode has not much room for optimization, thats up to the runtime engine. The only thing I see is: We use ASM or Javassist to create specialized methods on-the-fly (like a just-in-time compiler). Instead of Python code that is residing in the JAR file, we use a bytecode-generator that creates the packed int classes on the fly and loads them into the JVM using a private child classloader. This can do other code, too. > Native (C++) implementation of "pure OR" BooleanQuery > ----------------------------------------------------- > > Key: LUCENE-5049 > URL: https://issues.apache.org/jira/browse/LUCENE-5049 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael McCandless > Assignee: Michael McCandless > Attachments: LUCENE-5049.patch > > > I've been playing with a C++ implementation of BooleanQuery containing > only OR'd (SHOULD) TermQuery clauses, collecting top N hits by score. > The results are impressive: ~3X speedup for BQ OR over two terms, and > also good speedups (~38-78%) for Fuzzy1/2 as well since they rewrite > to BQ OR over N terms: > {noformat} > Task QPS base StdDev QPS comp StdDev > Pct diff > MedTerm 69.47 (15.8%) 68.61 (13.4%) > -1.2% ( -26% - 33%) > HighTerm 55.25 (16.2%) 54.63 (13.9%) > -1.1% ( -26% - 34%) > LowTerm 333.10 (9.6%) 329.43 (8.0%) > -1.1% ( -17% - 18%) > IntNRQ 3.37 (2.6%) 3.36 (4.6%) > -0.2% ( -7% - 7%) > Prefix3 18.91 (2.0%) 19.04 (3.5%) > 0.7% ( -4% - 6%) > Wildcard 29.40 (1.7%) 29.70 (2.8%) > 1.0% ( -3% - 5%) > MedPhrase 132.69 (6.2%) 134.66 (7.0%) > 1.5% ( -11% - 15%) > HighSloppyPhrase 0.82 (3.6%) 0.83 (3.5%) > 1.9% ( -5% - 9%) > AndHighHigh 19.65 (0.6%) 20.02 (0.8%) > 1.9% ( 0% - 3%) > HighPhrase 11.74 (6.6%) 11.96 (7.1%) > 1.9% ( -11% - 16%) > MedSloppyPhrase 29.09 (1.2%) 29.76 (1.9%) > 2.3% ( 0% - 5%) > LowSloppyPhrase 25.71 (1.4%) 26.98 (1.7%) > 4.9% ( 1% - 8%) > Respell 173.78 (3.0%) 182.41 (3.7%) > 5.0% ( -1% - 12%) > MedSpanNear 27.67 (2.5%) 29.07 (2.4%) > 5.1% ( 0% - 10%) > HighSpanNear 2.95 (2.4%) 3.10 (2.8%) > 5.4% ( 0% - 10%) > LowSpanNear 8.29 (3.4%) 8.82 (3.3%) > 6.4% ( 0% - 13%) > AndHighMed 79.32 (1.6%) 84.44 (1.0%) > 6.5% ( 3% - 9%) > LowPhrase 23.20 (2.0%) 25.14 (1.6%) > 8.4% ( 4% - 12%) > AndHighLow 594.17 (3.4%) 660.32 (1.9%) > 11.1% ( 5% - 16%) > Fuzzy2 88.32 (6.4%) 121.44 (1.7%) > 37.5% ( 27% - 48%) > Fuzzy1 86.34 (6.0%) 153.49 (1.7%) > 77.8% ( 66% - 90%) > OrHighHigh 16.29 (2.5%) 48.29 (1.3%) > 196.5% ( 188% - 205%) > OrHighMed 28.98 (2.7%) 87.81 (0.9%) > 203.0% ( 194% - 212%) > OrHighLow 27.38 (2.6%) 84.94 (1.1%) > 210.3% ( 201% - 219%) > {noformat} > This is essentially a scaled back attempt at LUCENE-1594 in that it's > "hardwired" to "just" the "OR of TermQuery" case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org