[
https://issues.apache.org/jira/browse/LUCENE-5049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679350#comment-13679350
]
Uwe Schindler edited comment on LUCENE-5049 at 6/10/13 6:59 AM:
----------------------------------------------------------------
Hi Mike,
I agree with Robert and Jack - this is like comparing apples and pies. We are
back at the same place like 4 (!) years ago when everybody added bulk APIs and
you posted a highly optimized special case with all l☃☃ps-unr☃lled™
(LUCENE-1594). This is comparing apples with pies: You use the specialized
MMapDirectory which is really a lot faster, so a lot of the improvements also
come from there. From most customers I have seen, the "OR" case with pure term
queries is not the most common one (although it should in reality, but users
want "and" - maybe because our default scoring is bad - other story?).
I am completely against the idea to have this anywhere in Lucene, same for
NativeMMapDirectory (and I am not happy with NativeLinux/WindowsDirectory, too
- although they are so special that they have some reason to exist). I would
never suggest anybody to actually use this in production, it is too risky. If
you want to release this code, its easy: Create a Google Code project and do it
outside of Lucene. All interconnection points here are through reflection, so
it can be completely separate. I definitely will not post you results anywhere
in twitter, because doing this would create another shitstorm against Lucene,
Java, Hotspot, and C++ - especially because the results here have nothing to do
with Java vs. C++ - its just specialization, nothing more. As Robert said, you
can do the same with pure Java (see LUCENE-1594).
The only possible way to bring C code back into the game is to bring CLucene
back to live!
bq. Seriously, a second question: What about alternative JVM-based languages? I
mean, maybe Java does have excess baggage related to its quirky semantics, but
could the raw JVM support a lower-level implementation of BQ, without leaving
the JVM... "bubble"? OTOH, maybe different JVM's could have different
performance characteristics.
I don't see any change in performance here, as other JVM-based languages
produce the same bytecode like javac, just from another source code. Java
bytecode is flexible but not too flexible. The optimizations are done by
hotspot and those bytecode has not much room for optimization, thats up to the
runtime engine.
The only thing I see is: We use ASM or Javassist to create specialized methods
on-the-fly (like a just-in-time compiler). Instead of static Python generated
code that is residing in the JAR file, we use a bytecode-generator that creates
the packed int classes on the fly and loads them into the JVM using a private
child classloader. This can do other code, too.
was (Author: thetaphi):
Hi Mike,
I agree with Robert and Jack - this is like comparing apples and pies. We are
back at the same place like 4 (!) years ago when everybody added bulk APIs and
you posted a highly optimized special case with all l☃☃ps-unr☃lled™
(LUCENE-1594). This is comparing apples with pies: You use the specialized
MMapDirectory which is really a lot faster, so a lot of the improvements also
come from there. From most customers I have seen, the "OR" case with pure term
queries is not the most common one (although it should in reality, but users
want "and" - maybe because our default scoring is bad - other story?).
I am completely against the idea to have this anywhere in Lucene, same for
NativeMMapDirectory (and I am not happy with NativeLinux/WindowsDirectory, too
- although they are so special that they have some reason to exist). I would
never suggest anybody to actually use this in production, it is too risky. If
you want to release this code, its easy: Create a Google Code project and do it
outside of Lucene. All interconnection points here are through reflection, so
it can be completely separate. I definitely will not post you results anywhere
in twitter, because doing this would create another shitstorm against Lucene,
Java, Hotspot, and C++ - especially because the results here have nothing to do
with Java vs. C++ - its just specialization, nothing more. As Robert said, you
can do the same with pure Java (see LUCENE-1594).
The only possible way to bring C code back into the game is to bring CLucene
back to live!
bq. Seriously, a second question: What about alternative JVM-based languages? I
mean, maybe Java does have excess baggage related to its quirky semantics, but
could the raw JVM support a lower-level implementation of BQ, without leaving
the JVM... "bubble"? OTOH, maybe different JVM's could have different
performance characteristics.
I don't see any change in performance here, as other JVM-based languages
produce the same bytecode like javac, just from another source code. Java
bytecode is flexible but not too flexible. The optimizations are done by
hotspot and those bytecode has not much room for optimization, thats up to the
runtime engine.
The only thing I see is: We use ASM or Javassist to create specialized methods
on-the-fly (like a just-in-time compiler). Instead of Python code that is
residing in the JAR file, we use a bytecode-generator that creates the packed
int classes on the fly and loads them into the JVM using a private child
classloader. This can do other code, too.
> Native (C++) implementation of "pure OR" BooleanQuery
> -----------------------------------------------------
>
> Key: LUCENE-5049
> URL: https://issues.apache.org/jira/browse/LUCENE-5049
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-5049.patch
>
>
> I've been playing with a C++ implementation of BooleanQuery containing
> only OR'd (SHOULD) TermQuery clauses, collecting top N hits by score.
> The results are impressive: ~3X speedup for BQ OR over two terms, and
> also good speedups (~38-78%) for Fuzzy1/2 as well since they rewrite
> to BQ OR over N terms:
> {noformat}
> Task QPS base StdDev QPS comp StdDev
> Pct diff
> MedTerm 69.47 (15.8%) 68.61 (13.4%)
> -1.2% ( -26% - 33%)
> HighTerm 55.25 (16.2%) 54.63 (13.9%)
> -1.1% ( -26% - 34%)
> LowTerm 333.10 (9.6%) 329.43 (8.0%)
> -1.1% ( -17% - 18%)
> IntNRQ 3.37 (2.6%) 3.36 (4.6%)
> -0.2% ( -7% - 7%)
> Prefix3 18.91 (2.0%) 19.04 (3.5%)
> 0.7% ( -4% - 6%)
> Wildcard 29.40 (1.7%) 29.70 (2.8%)
> 1.0% ( -3% - 5%)
> MedPhrase 132.69 (6.2%) 134.66 (7.0%)
> 1.5% ( -11% - 15%)
> HighSloppyPhrase 0.82 (3.6%) 0.83 (3.5%)
> 1.9% ( -5% - 9%)
> AndHighHigh 19.65 (0.6%) 20.02 (0.8%)
> 1.9% ( 0% - 3%)
> HighPhrase 11.74 (6.6%) 11.96 (7.1%)
> 1.9% ( -11% - 16%)
> MedSloppyPhrase 29.09 (1.2%) 29.76 (1.9%)
> 2.3% ( 0% - 5%)
> LowSloppyPhrase 25.71 (1.4%) 26.98 (1.7%)
> 4.9% ( 1% - 8%)
> Respell 173.78 (3.0%) 182.41 (3.7%)
> 5.0% ( -1% - 12%)
> MedSpanNear 27.67 (2.5%) 29.07 (2.4%)
> 5.1% ( 0% - 10%)
> HighSpanNear 2.95 (2.4%) 3.10 (2.8%)
> 5.4% ( 0% - 10%)
> LowSpanNear 8.29 (3.4%) 8.82 (3.3%)
> 6.4% ( 0% - 13%)
> AndHighMed 79.32 (1.6%) 84.44 (1.0%)
> 6.5% ( 3% - 9%)
> LowPhrase 23.20 (2.0%) 25.14 (1.6%)
> 8.4% ( 4% - 12%)
> AndHighLow 594.17 (3.4%) 660.32 (1.9%)
> 11.1% ( 5% - 16%)
> Fuzzy2 88.32 (6.4%) 121.44 (1.7%)
> 37.5% ( 27% - 48%)
> Fuzzy1 86.34 (6.0%) 153.49 (1.7%)
> 77.8% ( 66% - 90%)
> OrHighHigh 16.29 (2.5%) 48.29 (1.3%)
> 196.5% ( 188% - 205%)
> OrHighMed 28.98 (2.7%) 87.81 (0.9%)
> 203.0% ( 194% - 212%)
> OrHighLow 27.38 (2.6%) 84.94 (1.1%)
> 210.3% ( 201% - 219%)
> {noformat}
> This is essentially a scaled back attempt at LUCENE-1594 in that it's
> "hardwired" to "just" the "OR of TermQuery" case.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]