[jira] [Comment Edited] (LUCENE-5049) Native (C++) implementation of "pure OR" BooleanQuery

Uwe Schindler (JIRA) Mon, 10 Jun 2013 00:01:25 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679350#comment-13679350
 ]


Uwe Schindler edited comment on LUCENE-5049 at 6/10/13 6:59 AM:
----------------------------------------------------------------

Hi Mike,

I agree with Robert and Jack - this is like comparing apples and pies. We are 
back at the same place like 4 (!) years ago when everybody added bulk APIs and 
you posted a highly optimized special case with all l☃☃ps-unr☃lled™ 
(LUCENE-1594). This is comparing apples with pies: You use the specialized 
MMapDirectory which is really a lot faster, so a lot of the improvements also 
come from there. From most customers I have seen, the "OR" case with pure term 
queries is not the most common one (although it should in reality, but users 
want "and" - maybe because our default scoring is bad - other story?).

I am completely against the idea to have this anywhere in Lucene, same for 
NativeMMapDirectory (and I am not happy with NativeLinux/WindowsDirectory, too 
- although they are so special that they have some reason to exist). I would 
never suggest anybody to actually use this in production, it is too risky. If 
you want to release this code, its easy: Create a Google Code project and do it 
outside of Lucene. All interconnection points here are through reflection, so 
it can be completely separate. I definitely will not post you results anywhere 
in twitter, because doing this would create another shitstorm against Lucene, 
Java, Hotspot, and C++ - especially because the results here have nothing to do 
with Java vs. C++ - its just specialization, nothing more. As Robert said, you 
can do the same with pure Java (see LUCENE-1594).

The only possible way to bring C code back into the game is to bring CLucene 
back to live!

bq. Seriously, a second question: What about alternative JVM-based languages? I 
mean, maybe Java does have excess baggage related to its quirky semantics, but 
could the raw JVM support a lower-level implementation of BQ, without leaving 
the JVM... "bubble"? OTOH, maybe different JVM's could have different 
performance characteristics.

I don't see any change in performance here, as other JVM-based languages 
produce the same bytecode like javac, just from another source code. Java 
bytecode is flexible but not too flexible. The optimizations are done by 
hotspot and those bytecode has not much room for optimization, thats up to the 
runtime engine.

The only thing I see is: We use ASM or Javassist to create specialized methods 
on-the-fly (like a just-in-time compiler). Instead of static Python generated 
code that is residing in the JAR file, we use a bytecode-generator that creates 
the packed int classes on the fly and loads them into the JVM using a private 
child classloader. This can do other code, too.
                
      was (Author: thetaphi):
    Hi Mike,

I agree with Robert and Jack - this is like comparing apples and pies. We are 
back at the same place like 4 (!) years ago when everybody added bulk APIs and 
you posted a highly optimized special case with all l☃☃ps-unr☃lled™ 
(LUCENE-1594). This is comparing apples with pies: You use the specialized 
MMapDirectory which is really a lot faster, so a lot of the improvements also 
come from there. From most customers I have seen, the "OR" case with pure term 
queries is not the most common one (although it should in reality, but users 
want "and" - maybe because our default scoring is bad - other story?).

I am completely against the idea to have this anywhere in Lucene, same for 
NativeMMapDirectory (and I am not happy with NativeLinux/WindowsDirectory, too 
- although they are so special that they have some reason to exist). I would 
never suggest anybody to actually use this in production, it is too risky. If 
you want to release this code, its easy: Create a Google Code project and do it 
outside of Lucene. All interconnection points here are through reflection, so 
it can be completely separate. I definitely will not post you results anywhere 
in twitter, because doing this would create another shitstorm against Lucene, 
Java, Hotspot, and C++ - especially because the results here have nothing to do 
with Java vs. C++ - its just specialization, nothing more. As Robert said, you 
can do the same with pure Java (see LUCENE-1594).

The only possible way to bring C code back into the game is to bring CLucene 
back to live!

bq. Seriously, a second question: What about alternative JVM-based languages? I 
mean, maybe Java does have excess baggage related to its quirky semantics, but 
could the raw JVM support a lower-level implementation of BQ, without leaving 
the JVM... "bubble"? OTOH, maybe different JVM's could have different 
performance characteristics.

I don't see any change in performance here, as other JVM-based languages 
produce the same bytecode like javac, just from another source code. Java 
bytecode is flexible but not too flexible. The optimizations are done by 
hotspot and those bytecode has not much room for optimization, thats up to the 
runtime engine.

The only thing I see is: We use ASM or Javassist to create specialized methods 
on-the-fly (like a just-in-time compiler). Instead of Python code that is 
residing in the JAR file, we use a bytecode-generator that creates the packed 
int classes on the fly and loads them into the JVM using a private child 
classloader. This can do other code, too.
                  
> Native (C++) implementation of "pure OR" BooleanQuery
> -----------------------------------------------------
>
>                 Key: LUCENE-5049
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5049
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-5049.patch
>
>
> I've been playing with a C++ implementation of BooleanQuery containing
> only OR'd (SHOULD) TermQuery clauses, collecting top N hits by score.
> The results are impressive: ~3X speedup for BQ OR over two terms, and
> also good speedups (~38-78%) for Fuzzy1/2 as well since they rewrite
> to BQ OR over N terms:
> {noformat}
>                     Task    QPS base      StdDev    QPS comp      StdDev      
>           Pct diff
>                  MedTerm       69.47     (15.8%)       68.61     (13.4%)   
> -1.2% ( -26% -   33%)
>                 HighTerm       55.25     (16.2%)       54.63     (13.9%)   
> -1.1% ( -26% -   34%)
>                  LowTerm      333.10      (9.6%)      329.43      (8.0%)   
> -1.1% ( -17% -   18%)
>                   IntNRQ        3.37      (2.6%)        3.36      (4.6%)   
> -0.2% (  -7% -    7%)
>                  Prefix3       18.91      (2.0%)       19.04      (3.5%)    
> 0.7% (  -4% -    6%)
>                 Wildcard       29.40      (1.7%)       29.70      (2.8%)    
> 1.0% (  -3% -    5%)
>                MedPhrase      132.69      (6.2%)      134.66      (7.0%)    
> 1.5% ( -11% -   15%)
>         HighSloppyPhrase        0.82      (3.6%)        0.83      (3.5%)    
> 1.9% (  -5% -    9%)
>              AndHighHigh       19.65      (0.6%)       20.02      (0.8%)    
> 1.9% (   0% -    3%)
>               HighPhrase       11.74      (6.6%)       11.96      (7.1%)    
> 1.9% ( -11% -   16%)
>          MedSloppyPhrase       29.09      (1.2%)       29.76      (1.9%)    
> 2.3% (   0% -    5%)
>          LowSloppyPhrase       25.71      (1.4%)       26.98      (1.7%)    
> 4.9% (   1% -    8%)
>                  Respell      173.78      (3.0%)      182.41      (3.7%)    
> 5.0% (  -1% -   12%)
>              MedSpanNear       27.67      (2.5%)       29.07      (2.4%)    
> 5.1% (   0% -   10%)
>             HighSpanNear        2.95      (2.4%)        3.10      (2.8%)    
> 5.4% (   0% -   10%)
>              LowSpanNear        8.29      (3.4%)        8.82      (3.3%)    
> 6.4% (   0% -   13%)
>               AndHighMed       79.32      (1.6%)       84.44      (1.0%)    
> 6.5% (   3% -    9%)
>                LowPhrase       23.20      (2.0%)       25.14      (1.6%)    
> 8.4% (   4% -   12%)
>               AndHighLow      594.17      (3.4%)      660.32      (1.9%)   
> 11.1% (   5% -   16%)
>                   Fuzzy2       88.32      (6.4%)      121.44      (1.7%)   
> 37.5% (  27% -   48%)
>                   Fuzzy1       86.34      (6.0%)      153.49      (1.7%)   
> 77.8% (  66% -   90%)
>               OrHighHigh       16.29      (2.5%)       48.29      (1.3%)  
> 196.5% ( 188% -  205%)
>                OrHighMed       28.98      (2.7%)       87.81      (0.9%)  
> 203.0% ( 194% -  212%)
>                OrHighLow       27.38      (2.6%)       84.94      (1.1%)  
> 210.3% ( 201% -  219%)
> {noformat}
> This is essentially a scaled back attempt at LUCENE-1594 in that it's
> "hardwired" to "just" the "OR of TermQuery" case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-5049) Native (C++) implementation of "pure OR" BooleanQuery

Reply via email to