Oh that sounds possibly bad. Thanks for reporting. I'll try to take a look soon, but am about to travel ... Giving a talk at Berlin buzzwords! So it could be some time before I can really dig into it.
On Mon, Jun 10, 2019, 5:37 PM David Smiley (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > David Smiley reopened LUCENE-8781: > ---------------------------------- > > I set out to do some benchmarking today with toggling this setting on the > "FST50" PostingsFormat in org.apache.lucene.codecs.memory. In fst.Builder > I toggled the default setting of the shorter constructor, and then I ran > the tests with {{-Dtests.postingsformat=FST50}}. I got a couple assertion > errors from TestTermsEnum and TestSpanMultiTermQueryWrapper. > {noformat} > java.lang.AssertionError > at > __randomizedtesting.SeedInfo.seed([608D7E34838A15F1:84E00F70BA72F54D]:0) > at org.apache.lucene.util.fst.FST.readNextRealArc(FST.java:983) > at org.apache.lucene.util.fst.Util.readCeilArc(Util.java:992) > {noformat} > and > {noformat} > ava.lang.ArrayIndexOutOfBoundsException: Index -2147483405 out of bounds > for length 256 > > at > __randomizedtesting.SeedInfo.seed([2AA29F5F1414EE75:B06C6AD7E182571E]:0) > at > org.apache.lucene.util.fst.ReverseBytesReader.readByte(ReverseBytesReader.java:31) > at org.apache.lucene.util.fst.FST.readLabel(FST.java:507) > at org.apache.lucene.util.fst.Util.readCeilArc(Util.java:973) > {noformat} > > Due to the complexities of the FST internals and that some callers make > internal assumptions, it's not clear if this problem here might also occur > for the places where we enable this direct arc encoding. So I consider > this a bug. > > > Explore FST direct array arc encoding > > -------------------------------------- > > > > Key: LUCENE-8781 > > URL: https://issues.apache.org/jira/browse/LUCENE-8781 > > Project: Lucene - Core > > Issue Type: Improvement > > Reporter: Mike Sokolov > > Assignee: Dawid Weiss > > Priority: Major > > Fix For: master (9.0), 8.2 > > > > Attachments: FST-2-4.png, FST-6-9.png, FST-size.png > > > > Time Spent: 3h 20m > > Remaining Estimate: 0h > > > > This issue is for exploring an alternate FST encoding of Arcs as > full-sized arrays so Arcs are addressed directly by label, avoiding binary > search that we use today for arrays of Arcs. PR: > https://github.com/apache/lucene-solr/pull/657 > > h3. Testing > > ant test passes. I added some unit tests that were helpful in uncovering > bugs while > > implementing which are more difficult to chase down when uncovered by > the randomized testing we already do. They don't really test anything new; > they're just more focused. > > I'm not sure why, but ant precommit failed for me with: > > {noformat} > > ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls > failed while scanning class > 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' > (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: > info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about > referenced class 'info.ganglia.gmetric4j.gmetric.GMetric') > > {noformat} > > I also got Test2BFST running (it was originally timing out due to > excessive calls to ramBytesUsage(), which seems to have gotten slow), and > it passed; that change isn't include here. > > h4. Micro-benchmark > > I timed lookups in FST via FSTEnum.seekExact in a unit test under > various conditions. > > h5. English words > > A test of looking up existing words in a dictionary of ~170000 English > words shows improvements; the numbers listed are % change in FST size, time > to look up (FSTEnum.seekExact) words that are in the dict, and time to look > up random strings that are not in the dict. The comparison is against the > current codebase with the optimization disabled. A separate comparison of > showed no significant change of the baseline (no opto applied) vs the > current master FST impl with no code changes applied. > > || load=2 || load=4 || load=16 || > > | +4, -6, -7 | +18, -11, -8 | +22, -11.5, -7 | > > The "load factor" used for those measurements controls when direct array > arc encoding is used; > > namely when the number of outgoing arcs was > load * (max label - min > label). > > h5. sequential and random terms > > The same test, with terms being a sequence of integers as strings shows > a larger improvement, around 20% (load=4). This is presumably the best case > for this delta, where every Arc is encoded as a direct lookup. > > When random lowercase ASCII strings are used, a smaller improvement of > around 4% is seen. > > h4. luceneutil > > Testing w/luceneutil (wikimediumall) we see improvements mostly in the > PKLookup case. Other results seem noisy, with perhaps a small improvment in > some of the queries. > > {noformat} > > Task QPS base StdDev QPS opto > StdDev Pct diff > > OrHighHigh 6.93 (3.0%) 6.89 > (3.1%) -0.5% ( -6% - 5%) > > OrHighMed 45.15 (3.9%) 44.92 > (3.5%) -0.5% ( -7% - 7%) > > Wildcard 8.72 (4.7%) 8.69 > (4.6%) -0.4% ( -9% - 9%) > > AndHighLow 274.11 (2.6%) 273.58 > (3.1%) -0.2% ( -5% - 5%) > > OrHighLow 241.41 (1.9%) 241.11 > (3.5%) -0.1% ( -5% - 5%) > > AndHighMed 52.23 (4.1%) 52.41 > (5.3%) 0.3% ( -8% - 10%) > > MedTerm 1026.24 (3.1%) 1030.52 > (4.3%) 0.4% ( -6% - 8%) > > HighTerm 1111.10 (3.4%) 1116.70 > (4.0%) 0.5% ( -6% - 8%) > > HighTermDayOfYearSort 14.59 (8.2%) 14.73 > (9.3%) 1.0% ( -15% - 20%) > > AndHighHigh 13.45 (6.2%) 13.61 > (4.4%) 1.2% ( -8% - 12%) > > HighTermMonthSort 63.09 (12.5%) 64.13 > (10.9%) 1.6% ( -19% - 28%) > > LowTerm 1338.94 (3.3%) 1383.90 > (5.5%) 3.4% ( -5% - 12%) > > PKLookup 120.45 (2.5%) 130.91 > (3.5%) 8.7% ( 2% - 15%) > > {noformat} > > h4.FST perf tests > > I ran LookupBenchmarkTest to see the impact on the suggesters which make > heavy use of FSTs. Some show little or no improvement, but in some cases > there are substantial gains. > > !FST-2-4.png! > > !FST-6-9.png! > > !FST-size.png! > > chart TK > > h3. API change / implementation notes > > The only change in the public FST API is that the Builder constructor > now takes an additional > > boolean ("useDirectArcAddressing") controlling whether or not this new > optimization is > > applied. However, FST's internal details are not really hidden, so in > practice any change to its > > encoding can have ripple effects in other classes. > > This is because the FST decoding is repeated in several places in the > code base, sometimes with > > subtle variations: eg FST, FSTEnum, and fst.Util have very similar, but > not shared, traversal code, > > and there are a few other places this same or similar decoding algorithm > appears as well (eg > > blocktreeords codec). I do think it would be useful to make this code > more DRY, and to hide the > > traversal implementation within a single class (or at least in the > package), but it seemed better to > > do that separately. I also think there are possibilities for improving > this encoding further, maybe > > by encoding Arcs in contiguous blocks with gaps in between, but before > attempting anything like > > that, I think it would good to do that refactor to make the code in its > current form easier to > > understand and work with. > > > > -- > This message was sent by Atlassian JIRA > (v7.6.3#76005) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
