[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552184#comment-13552184
 ] 

Michael McCandless commented on LUCENE-4682:
--------------------------------------------

OK I ran several performance tests, comparing trunk (lots of waste),
the "up to 25% waste", and 0 waste (no array arcs).

First, I tested luceneutil on all of English Wikipedia, on an optimized index:

Base = trunk (lots of waste), comp = up to 25% waste:

{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
                 Respell      167.37      (3.3%)      128.36      (3.2%)  
-23.3% ( -28% -  -17%)
                  Fuzzy2       83.54      (8.0%)       74.17      (7.2%)  
-11.2% ( -24% -    4%)
                PKLookup      359.62      (2.9%)      346.88      (4.5%)   
-3.5% ( -10% -    4%)
                Wildcard       28.48      (4.3%)       28.25      (3.9%)   
-0.8% (  -8% -    7%)
                  Fuzzy1       82.77      (7.1%)       82.36      (7.7%)   
-0.5% ( -14% -   15%)
        HighSloppyPhrase        0.78      (4.6%)        0.78      (4.5%)   
-0.4% (  -9% -    9%)
                  IntNRQ        3.38      (8.1%)        3.37      (8.1%)   
-0.2% ( -15% -   17%)
         MedSloppyPhrase       28.13      (2.4%)       28.13      (2.4%)   
-0.0% (  -4% -    4%)
         LowSloppyPhrase       25.10      (1.8%)       25.11      (1.8%)    
0.0% (  -3% -    3%)
                 Prefix3       18.40      (4.8%)       18.41      (5.3%)    
0.1% (  -9% -   10%)
                 MedTerm       65.42     (18.2%)       65.46     (17.7%)    
0.1% ( -30% -   43%)
                 LowTerm      316.34      (6.9%)      317.45      (7.1%)    
0.4% ( -12% -   15%)
               LowPhrase       23.30      (2.1%)       23.39      (1.8%)    
0.4% (  -3% -    4%)
                HighTerm       52.11     (18.6%)       52.33     (18.5%)    
0.4% ( -30% -   46%)
               OrHighMed       25.65      (6.7%)       25.76      (7.2%)    
0.4% ( -12% -   15%)
             AndHighHigh       19.42      (1.8%)       19.52      (2.0%)    
0.5% (  -3% -    4%)
               OrHighLow       24.19      (7.1%)       24.36      (7.4%)    
0.7% ( -12% -   16%)
              OrHighHigh       14.32      (6.6%)       14.45      (7.3%)    
0.9% ( -12% -   15%)
            HighSpanNear        3.30      (3.9%)        3.33      (3.1%)    
0.9% (  -5% -    8%)
             LowSpanNear        8.94      (4.3%)        9.04      (3.4%)    
1.1% (  -6% -    9%)
              AndHighMed       78.66      (1.6%)       79.82      (1.4%)    
1.5% (  -1% -    4%)
             MedSpanNear       30.45      (3.1%)       30.91      (2.8%)    
1.5% (  -4% -    7%)
               MedPhrase      130.74      (5.8%)      133.03      (5.9%)    
1.8% (  -9% -   14%)
              AndHighLow      571.07      (3.5%)      582.83      (2.8%)    
2.1% (  -4% -    8%)
              HighPhrase       11.62      (6.1%)       11.88      (6.4%)    
2.2% (  -9% -   15%)
{noformat}

trunk .tip file was 23928963 and comp was 17125654 bytes (~28% smaller!).

Then, base=trunk, comp=0 waste (no array arcs):

{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
                PKLookup      359.26      (2.3%)      152.91      (5.3%)  
-57.4% ( -63% -  -51%)
                 Respell      168.82      (3.4%)      128.85      (2.2%)  
-23.7% ( -28% -  -18%)
                  Fuzzy2       85.59      (8.2%)       75.03      (6.6%)  
-12.3% ( -25% -    2%)
                 LowTerm      331.01      (7.9%)      324.30      (7.2%)   
-2.0% ( -15% -   14%)
                HighTerm       54.75     (19.2%)       54.00     (19.2%)   
-1.4% ( -33% -   45%)
                 MedTerm       68.68     (18.5%)       68.04     (19.0%)   
-0.9% ( -32% -   44%)
              AndHighMed       80.50      (1.4%)       79.78      (1.4%)   
-0.9% (  -3% -    1%)
                Wildcard       29.13      (4.5%)       28.89      (4.0%)   
-0.8% (  -8% -    8%)
               LowPhrase       23.88      (2.3%)       23.69      (1.5%)   
-0.8% (  -4% -    3%)
              AndHighLow      584.30      (2.9%)      582.17      (2.9%)   
-0.4% (  -6% -    5%)
                  Fuzzy1       83.84      (6.9%)       83.62      (6.7%)   
-0.3% ( -12% -   14%)
             AndHighHigh       19.88      (1.8%)       19.84      (1.5%)   
-0.2% (  -3% -    3%)
         LowSloppyPhrase       25.61      (1.9%)       25.56      (2.0%)   
-0.2% (  -4% -    3%)
             LowSpanNear        9.20      (3.6%)        9.20      (3.9%)    
0.0% (  -7% -    7%)
             MedSpanNear       31.32      (2.7%)       31.36      (2.9%)    
0.1% (  -5% -    5%)
         MedSloppyPhrase       28.67      (2.3%)       28.73      (2.2%)    
0.2% (  -4% -    4%)
                 Prefix3       18.83      (4.9%)       18.88      (5.4%)    
0.3% (  -9% -   11%)
              OrHighHigh       14.71      (7.5%)       14.76      (8.2%)    
0.3% ( -14% -   17%)
            HighSpanNear        3.39      (3.0%)        3.40      (2.9%)    
0.5% (  -5% -    6%)
               OrHighMed       26.28      (7.3%)       26.47      (8.1%)    
0.7% ( -13% -   17%)
                  IntNRQ        3.44      (8.6%)        3.47      (9.3%)    
0.9% ( -15% -   20%)
               OrHighLow       24.69      (7.3%)       24.97      (8.3%)    
1.1% ( -13% -   18%)
        HighSloppyPhrase        0.80      (4.8%)        0.81      (4.7%)    
1.3% (  -7% -   11%)
              HighPhrase       11.89      (6.1%)       12.20      (7.4%)    
2.6% ( -10% -   17%)
               MedPhrase      134.04      (5.5%)      138.07      (6.4%)    
3.0% (  -8% -   15%)
{noformat}

comp .tip file was 16806594 (just a bit smaller than "up to 25% waste").

Curious how PK was barely affected by the "up to 25%", but heavily
affected by the "0 waste", and how Respell/Fuzzy2 lost perf going to
"up to 25% waste" and then didn't lose much going to "0 waste".  I
suspect PK will be sensitive to the key structure (luceneutil uses
base 36 keys)...


Next I tested Kuromoji, just tokenizing first 100K docs from jawiki.

  * trunk (lots of waste): TokenInfoFST was 1535643 bytes, tokenizing 100K docs 
took 163.3 sec
  * up to 25% waste: TokenInfoFST was 1309108 bytes, tokenizing 100K docs took 
215.3 sec
  * 0 waste: TokenInfoFST was 1228847 bytes, tokenizing 100K docs took 218.0 sec

Looks like Kuromoji sees good gains from the > 25% waste arcs... but
to keep this in perspective, in a "real" app, you index after
tokenizing and that indexing time will dwarf the tokenization time.
But then on the flip side we are "only" talking about ~221 KB ...

Finally, I tested building a no-outputs FST for all Wikipedia English
terms (9.8 M terms):

  * trunk (lots of waste): 59267794 bytes, 566 nsec per lookup
  * up to 25% waste: 58509011 bytes, 567 nsec per lookup
  * 0 waste: 56413148 bytes, 1808 nsec per lookup

For this case it looks like you get all the benefit allowing only 25%
waste (though, most of the byte increase also?).

                
> Reduce wasted bytes in FST due to array arcs
> --------------------------------------------
>
>                 Key: LUCENE-4682
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4682
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch
>
>
> When a node is close to the root, or it has many outgoing arcs, the FST 
> writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
> search on lookup.
> The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
> arc e.g. with a big output, you can waste many bytes for all the other arcs 
> that didn't need so many bytes.
> I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
> 1535612 = ~18% wasted.
> It would be nice to reduce this.
> One thing we could do without packing is: in addNode, if we detect that 
> number of wasted bytes is above some threshold, then don't do the expansion.
> Another thing, if we are packing: we could record stats in the first pass 
> about which nodes wasted the most, and then in the second pass (paack) we 
> could set the threshold based on the top X% nodes that waste ...
> Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
> more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to