[jira] [Updated] (LUCENE-3729) Allow using FST to hold terms data in DocValues.BYTES_*_SORTED

Michael McCandless (Updated) (JIRA) Sun, 29 Jan 2012 10:03:36 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-3729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-3729:
---------------------------------------

    Attachment: LUCENE-3729.patch

Prototype patch just for testing...

As a quick test for viability here... I hacked
FieldCacheImpl.DocTermsIndexImpl, to build an FST to map term <-> ord,
and changed the lookup method to use the new Util.getByOutput method.

Then I tested perf on 10M docs from Wikipedia:

{noformat}
                Task    QPS base StdDev base   QPS fstfcStdDev fstfc      Pct 
diff
         TermGroup1M       47.75        1.59       25.75        0.36  -48% -  
-43%
        TermBGroup1M       17.10        0.58       14.20        0.37  -21% -  
-11%
            PKLookup      158.73        6.07      155.84        3.00   -7% -    
4%
       TermTitleSort       43.49        2.54       42.73        1.84  -11% -    
8%
             Respell       81.13        3.24       80.67        3.83   -8% -    
8%
                Term      106.13        3.59      106.03        1.28   -4% -    
4%
      TermBGroup1M1P       25.31        0.44       25.37        0.54   -3% -    
4%
              Fuzzy2       55.32        1.21       55.76        2.55   -5% -    
7%
              Fuzzy1       74.06        1.21       74.88        2.80   -4% -    
6%
        SloppyPhrase        9.82        0.61        9.95        0.42   -8% -   
12%
            SpanNear        3.39        0.16        3.47        0.15   -6% -   
12%
              Phrase        9.29        0.69        9.66        0.69  -10% -   
20%
            Wildcard       20.15        0.66       21.23        0.46    0% -   
11%
         AndHighHigh       13.43        0.55       14.24        0.70   -3% -   
15%
             Prefix3       10.05        0.53       10.70        0.19    0% -   
14%
          AndHighMed       56.62        3.36       60.54        4.28   -6% -   
21%
           OrHighMed       25.78        0.98       27.75        1.51   -1% -   
17%
          OrHighHigh       10.97        0.41       11.82        0.63   -1% -   
17%
              IntNRQ        9.74        0.81       10.83        0.26    0% -   
24%
{noformat}

Two-pass grouping took a big hit... and single-pass grouping a moderate
hit... but TermTitleSort was a minor slowdown, which is good news.

The net RAM required across all segs for the title field FST was 30.2
MB, vs 46.5 MB for the current FieldCache terms storage (PagedBytes +
PackedInts), which is ~35% less.

The FST for the group-by fields was quite a bit larger (~60%) RAM
usage than PagedBytes + PackedInts, because these fields are actually
randomly generated unicode strings...

I didn't make the change to use the FST for term -> ord lookup (have
to fix the binarySearchLookup method), but we really should do this
"for real" because it's doing an unnecessary binary search (repeated
ord -> term lookup) now.  Ie, perf should be better than
above... grouping is a heavier user of binarySearchLookup than sorting
so it should help recover some of that slowdown.

Also, Util.getByOutput currently doesn't optimize for array'd
arcs... so if we fix that we should get some small perf gain.

To do this "for real" I think we should do it only with DocValues,
because the FST build time is relatively costly.

                
> Allow using FST to hold terms data in DocValues.BYTES_*_SORTED
> --------------------------------------------------------------
>
>                 Key: LUCENE-3729
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3729
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3729.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-3729) Allow using FST to hold terms data in DocValues.BYTES_*_SORTED

Reply via email to