Re: [jira] [Reopened] (LUCENE-8781) Explore FST direct array arc encoding

Michael Sokolov Tue, 11 Jun 2019 04:46:30 -0700

Oh that sounds possibly  bad. Thanks for reporting.  I'll try to take a
look soon, but am about to travel ... Giving a talk at Berlin buzzwords! So
it could be some time before I can really dig into it.


On Mon, Jun 10, 2019, 5:37 PM David Smiley (JIRA) <[email protected]> wrote:

>
>      [
> https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
>
> David Smiley reopened LUCENE-8781:
> ----------------------------------
>
> I set out to do some benchmarking today with toggling this setting on the
> "FST50" PostingsFormat in org.apache.lucene.codecs.memory.  In fst.Builder
> I toggled the default setting of the shorter constructor, and then I ran
> the tests with {{-Dtests.postingsformat=FST50}}.  I got a couple assertion
> errors from TestTermsEnum and TestSpanMultiTermQueryWrapper.
> {noformat}
> java.lang.AssertionError
>         at
> __randomizedtesting.SeedInfo.seed([608D7E34838A15F1:84E00F70BA72F54D]:0)
>         at org.apache.lucene.util.fst.FST.readNextRealArc(FST.java:983)
>         at org.apache.lucene.util.fst.Util.readCeilArc(Util.java:992)
> {noformat}
> and
> {noformat}
> ava.lang.ArrayIndexOutOfBoundsException: Index -2147483405 out of bounds
> for length 256
>
>         at
> __randomizedtesting.SeedInfo.seed([2AA29F5F1414EE75:B06C6AD7E182571E]:0)
>         at
> org.apache.lucene.util.fst.ReverseBytesReader.readByte(ReverseBytesReader.java:31)
>         at org.apache.lucene.util.fst.FST.readLabel(FST.java:507)
>         at org.apache.lucene.util.fst.Util.readCeilArc(Util.java:973)
> {noformat}
>
> Due to the complexities of the FST internals and that some callers make
> internal assumptions, it's not clear if this problem here might also occur
> for the places where we enable this direct arc encoding.  So I consider
> this a bug.
>
> > Explore FST direct array arc encoding
> > --------------------------------------
> >
> >                 Key: LUCENE-8781
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-8781
> >             Project: Lucene - Core
> >          Issue Type: Improvement
> >            Reporter: Mike Sokolov
> >            Assignee: Dawid Weiss
> >            Priority: Major
> >             Fix For: master (9.0), 8.2
> >
> >         Attachments: FST-2-4.png, FST-6-9.png, FST-size.png
> >
> >          Time Spent: 3h 20m
> >  Remaining Estimate: 0h
> >
> > This issue is for exploring an alternate FST encoding of Arcs as
> full-sized arrays so Arcs are addressed directly by label, avoiding binary
> search that we use today for arrays of Arcs. PR:
> https://github.com/apache/lucene-solr/pull/657
> > h3. Testing
> > ant test passes. I added some unit tests that were helpful in uncovering
> bugs while
> > implementing which are more difficult to chase down when uncovered by
> the randomized testing we already do. They don't really test anything new;
> they're just more focused.
> > I'm not sure why, but ant precommit failed for me with:
> > {noformat}
> >  ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls
> failed while scanning class
> 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest'
> (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException:
> info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about
> referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
> > {noformat}
> > I also got Test2BFST running (it was originally timing out due to
> excessive calls to ramBytesUsage(), which seems to have gotten slow), and
> it passed; that change isn't include here.
> > h4. Micro-benchmark
> > I timed lookups in FST via FSTEnum.seekExact in a unit test under
> various conditions.
> > h5. English words
> > A test of looking up existing words in a dictionary of ~170000 English
> words shows improvements; the numbers listed are % change in FST size, time
> to look up (FSTEnum.seekExact) words that are in the dict, and time to look
> up random strings that are not in the dict. The comparison is against the
> current codebase with the optimization disabled. A separate comparison of
> showed no significant change of the baseline (no opto applied) vs the
> current master FST impl with no code changes applied.
> > ||  load=2    ||   load=4     ||  load=16 ||
> > | +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7 |
> > The "load factor" used for those measurements controls when direct array
> arc encoding is used;
> > namely when the number of outgoing arcs was > load * (max label - min
> label).
> > h5. sequential and random terms
> > The same test, with terms being a sequence of integers as strings shows
> a larger improvement, around 20% (load=4). This is presumably the best case
> for this delta, where every Arc is encoded as a direct lookup.
> > When random lowercase ASCII strings are used, a smaller improvement of
> around 4% is seen.
> > h4. luceneutil
> > Testing w/luceneutil (wikimediumall) we see improvements mostly in the
> PKLookup case. Other results seem noisy, with perhaps a small improvment in
> some of the queries.
> > {noformat}
> >                     Task    QPS base      StdDev    QPS opto
> StdDev                Pct diff
> >               OrHighHigh        6.93      (3.0%)        6.89
> (3.1%)   -0.5% (  -6% -    5%)
> >                OrHighMed       45.15      (3.9%)       44.92
> (3.5%)   -0.5% (  -7% -    7%)
> >                 Wildcard        8.72      (4.7%)        8.69
> (4.6%)   -0.4% (  -9% -    9%)
> >               AndHighLow      274.11      (2.6%)      273.58
> (3.1%)   -0.2% (  -5% -    5%)
> >                OrHighLow      241.41      (1.9%)      241.11
> (3.5%)   -0.1% (  -5% -    5%)
> >               AndHighMed       52.23      (4.1%)       52.41
> (5.3%)    0.3% (  -8% -   10%)
> >                  MedTerm     1026.24      (3.1%)     1030.52
> (4.3%)    0.4% (  -6% -    8%)
> >                 HighTerm     1111.10      (3.4%)     1116.70
> (4.0%)    0.5% (  -6% -    8%)
> >    HighTermDayOfYearSort       14.59      (8.2%)       14.73
> (9.3%)    1.0% ( -15% -   20%)
> >              AndHighHigh       13.45      (6.2%)       13.61
> (4.4%)    1.2% (  -8% -   12%)
> >        HighTermMonthSort       63.09     (12.5%)       64.13
>  (10.9%)    1.6% ( -19% -   28%)
> >                  LowTerm     1338.94      (3.3%)     1383.90
> (5.5%)    3.4% (  -5% -   12%)
> >                 PKLookup      120.45      (2.5%)      130.91
> (3.5%)    8.7% (   2% -   15%)
> > {noformat}
> > h4.FST perf tests
> > I ran LookupBenchmarkTest to see the impact on the suggesters which make
> heavy use of FSTs. Some show little or no improvement, but in some cases
> there are substantial gains.
> > !FST-2-4.png!
> > !FST-6-9.png!
> > !FST-size.png!
> >   chart TK
> > h3. API change / implementation notes
> > The only change in the public FST API is that the Builder constructor
> now takes an additional
> > boolean ("useDirectArcAddressing") controlling whether or not this new
> optimization is
> > applied. However, FST's internal details are not really hidden, so in
> practice any change to its
> > encoding can have ripple effects in other classes.
> > This is because the FST decoding is repeated in several places in the
> code base, sometimes with
> > subtle variations: eg FST, FSTEnum, and fst.Util have very similar, but
> not shared, traversal code,
> > and there are a few other places this same or similar decoding algorithm
> appears as well (eg
> > blocktreeords codec). I do think it would be useful to make this code
> more DRY, and to hide the
> > traversal implementation within a single class (or at least in the
> package), but it seemed better to
> > do that separately. I also think there are possibilities for improving
> this encoding further, maybe
> > by encoding Arcs in contiguous blocks with gaps in between, but before
> attempting anything like
> > that, I think it would good to do that refactor to make the code in its
> current form easier to
> > understand and work with.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [jira] [Reopened] (LUCENE-8781) Explore FST direct array arc encoding

Reply via email to