[ 
https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-8781:
---------------------------------
    Fix Version/s:     (was: 8.x)
                   8.2

> Explore FST direct array arc encoding 
> --------------------------------------
>
>                 Key: LUCENE-8781
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8781
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Assignee: Dawid Weiss
>            Priority: Major
>             Fix For: master (9.0), 8.2
>
>         Attachments: FST-2-4.png, FST-6-9.png, FST-size.png
>
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> This issue is for exploring an alternate FST encoding of Arcs as full-sized 
> arrays so Arcs are addressed directly by label, avoiding binary search that 
> we use today for arrays of Arcs. PR: 
> https://github.com/apache/lucene-solr/pull/657
> h3. Testing
> ant test passes. I added some unit tests that were helpful in uncovering bugs 
> while
> implementing which are more difficult to chase down when uncovered by the 
> randomized testing we already do. They don't really test anything new; 
> they're just more focused.
> I'm not sure why, but ant precommit failed for me with:
> {noformat}
>  ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls 
> failed while scanning class 
> 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' 
> (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: 
> info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about 
> referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
> {noformat}
> I also got Test2BFST running (it was originally timing out due to excessive 
> calls to ramBytesUsage(), which seems to have gotten slow), and it passed; 
> that change isn't include here.
> h4. Micro-benchmark
> I timed lookups in FST via FSTEnum.seekExact in a unit test under various 
> conditions. 
> h5. English words
> A test of looking up existing words in a dictionary of ~170000 English words 
> shows improvements; the numbers listed are % change in FST size, time to look 
> up (FSTEnum.seekExact) words that are in the dict, and time to look up random 
> strings that are not in the dict. The comparison is against the current 
> codebase with the optimization disabled. A separate comparison of showed no 
> significant change of the baseline (no opto applied) vs the current master 
> FST impl with no code changes applied.
> ||  load=2    ||   load=4     ||  load=16 ||
> | +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7 |
> The "load factor" used for those measurements controls when direct array arc 
> encoding is used;
> namely when the number of outgoing arcs was > load * (max label - min label).
> h5. sequential and random terms
> The same test, with terms being a sequence of integers as strings shows a 
> larger improvement, around 20% (load=4). This is presumably the best case for 
> this delta, where every Arc is encoded as a direct lookup.
> When random lowercase ASCII strings are used, a smaller improvement of around 
> 4% is seen.
> h4. luceneutil
> Testing w/luceneutil (wikimediumall) we see improvements mostly in the 
> PKLookup case. Other results seem noisy, with perhaps a small improvment in 
> some of the queries.
> {noformat}
>                     Task    QPS base      StdDev    QPS opto      StdDev      
>           Pct diff
>               OrHighHigh        6.93      (3.0%)        6.89      (3.1%)   
> -0.5% (  -6% -    5%)
>                OrHighMed       45.15      (3.9%)       44.92      (3.5%)   
> -0.5% (  -7% -    7%)
>                 Wildcard        8.72      (4.7%)        8.69      (4.6%)   
> -0.4% (  -9% -    9%)
>               AndHighLow      274.11      (2.6%)      273.58      (3.1%)   
> -0.2% (  -5% -    5%)
>                OrHighLow      241.41      (1.9%)      241.11      (3.5%)   
> -0.1% (  -5% -    5%)
>               AndHighMed       52.23      (4.1%)       52.41      (5.3%)    
> 0.3% (  -8% -   10%)
>                  MedTerm     1026.24      (3.1%)     1030.52      (4.3%)    
> 0.4% (  -6% -    8%)
>                 HighTerm     1111.10      (3.4%)     1116.70      (4.0%)    
> 0.5% (  -6% -    8%)
>    HighTermDayOfYearSort       14.59      (8.2%)       14.73      (9.3%)    
> 1.0% ( -15% -   20%)
>              AndHighHigh       13.45      (6.2%)       13.61      (4.4%)    
> 1.2% (  -8% -   12%)
>        HighTermMonthSort       63.09     (12.5%)       64.13     (10.9%)    
> 1.6% ( -19% -   28%)
>                  LowTerm     1338.94      (3.3%)     1383.90      (5.5%)    
> 3.4% (  -5% -   12%)
>                 PKLookup      120.45      (2.5%)      130.91      (3.5%)    
> 8.7% (   2% -   15%)
> {noformat}
> h4.FST perf tests
> I ran LookupBenchmarkTest to see the impact on the suggesters which make 
> heavy use of FSTs. Some show little or no improvement, but in some cases 
> there are substantial gains.
> !FST-2-4.png!
> !FST-6-9.png!
> !FST-size.png!
>   chart TK
> h3. API change / implementation notes
> The only change in the public FST API is that the Builder constructor now 
> takes an additional
> boolean ("useDirectArcAddressing") controlling whether or not this new 
> optimization is
> applied. However, FST's internal details are not really hidden, so in 
> practice any change to its
> encoding can have ripple effects in other classes.
> This is because the FST decoding is repeated in several places in the code 
> base, sometimes with
> subtle variations: eg FST, FSTEnum, and fst.Util have very similar, but not 
> shared, traversal code,
> and there are a few other places this same or similar decoding algorithm 
> appears as well (eg
> blocktreeords codec). I do think it would be useful to make this code more 
> DRY, and to hide the
> traversal implementation within a single class (or at least in the package), 
> but it seemed better to
> do that separately. I also think there are possibilities for improving this 
> encoding further, maybe
> by encoding Arcs in contiguous blocks with gaps in between, but before 
> attempting anything like
> that, I think it would good to do that refactor to make the code in its 
> current form easier to
> understand and work with.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to