[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552184#comment-13552184 ]
Michael McCandless commented on LUCENE-4682: -------------------------------------------- OK I ran several performance tests, comparing trunk (lots of waste), the "up to 25% waste", and 0 waste (no array arcs). First, I tested luceneutil on all of English Wikipedia, on an optimized index: Base = trunk (lots of waste), comp = up to 25% waste: {noformat} Task QPS base StdDev QPS comp StdDev Pct diff Respell 167.37 (3.3%) 128.36 (3.2%) -23.3% ( -28% - -17%) Fuzzy2 83.54 (8.0%) 74.17 (7.2%) -11.2% ( -24% - 4%) PKLookup 359.62 (2.9%) 346.88 (4.5%) -3.5% ( -10% - 4%) Wildcard 28.48 (4.3%) 28.25 (3.9%) -0.8% ( -8% - 7%) Fuzzy1 82.77 (7.1%) 82.36 (7.7%) -0.5% ( -14% - 15%) HighSloppyPhrase 0.78 (4.6%) 0.78 (4.5%) -0.4% ( -9% - 9%) IntNRQ 3.38 (8.1%) 3.37 (8.1%) -0.2% ( -15% - 17%) MedSloppyPhrase 28.13 (2.4%) 28.13 (2.4%) -0.0% ( -4% - 4%) LowSloppyPhrase 25.10 (1.8%) 25.11 (1.8%) 0.0% ( -3% - 3%) Prefix3 18.40 (4.8%) 18.41 (5.3%) 0.1% ( -9% - 10%) MedTerm 65.42 (18.2%) 65.46 (17.7%) 0.1% ( -30% - 43%) LowTerm 316.34 (6.9%) 317.45 (7.1%) 0.4% ( -12% - 15%) LowPhrase 23.30 (2.1%) 23.39 (1.8%) 0.4% ( -3% - 4%) HighTerm 52.11 (18.6%) 52.33 (18.5%) 0.4% ( -30% - 46%) OrHighMed 25.65 (6.7%) 25.76 (7.2%) 0.4% ( -12% - 15%) AndHighHigh 19.42 (1.8%) 19.52 (2.0%) 0.5% ( -3% - 4%) OrHighLow 24.19 (7.1%) 24.36 (7.4%) 0.7% ( -12% - 16%) OrHighHigh 14.32 (6.6%) 14.45 (7.3%) 0.9% ( -12% - 15%) HighSpanNear 3.30 (3.9%) 3.33 (3.1%) 0.9% ( -5% - 8%) LowSpanNear 8.94 (4.3%) 9.04 (3.4%) 1.1% ( -6% - 9%) AndHighMed 78.66 (1.6%) 79.82 (1.4%) 1.5% ( -1% - 4%) MedSpanNear 30.45 (3.1%) 30.91 (2.8%) 1.5% ( -4% - 7%) MedPhrase 130.74 (5.8%) 133.03 (5.9%) 1.8% ( -9% - 14%) AndHighLow 571.07 (3.5%) 582.83 (2.8%) 2.1% ( -4% - 8%) HighPhrase 11.62 (6.1%) 11.88 (6.4%) 2.2% ( -9% - 15%) {noformat} trunk .tip file was 23928963 and comp was 17125654 bytes (~28% smaller!). Then, base=trunk, comp=0 waste (no array arcs): {noformat} Task QPS base StdDev QPS comp StdDev Pct diff PKLookup 359.26 (2.3%) 152.91 (5.3%) -57.4% ( -63% - -51%) Respell 168.82 (3.4%) 128.85 (2.2%) -23.7% ( -28% - -18%) Fuzzy2 85.59 (8.2%) 75.03 (6.6%) -12.3% ( -25% - 2%) LowTerm 331.01 (7.9%) 324.30 (7.2%) -2.0% ( -15% - 14%) HighTerm 54.75 (19.2%) 54.00 (19.2%) -1.4% ( -33% - 45%) MedTerm 68.68 (18.5%) 68.04 (19.0%) -0.9% ( -32% - 44%) AndHighMed 80.50 (1.4%) 79.78 (1.4%) -0.9% ( -3% - 1%) Wildcard 29.13 (4.5%) 28.89 (4.0%) -0.8% ( -8% - 8%) LowPhrase 23.88 (2.3%) 23.69 (1.5%) -0.8% ( -4% - 3%) AndHighLow 584.30 (2.9%) 582.17 (2.9%) -0.4% ( -6% - 5%) Fuzzy1 83.84 (6.9%) 83.62 (6.7%) -0.3% ( -12% - 14%) AndHighHigh 19.88 (1.8%) 19.84 (1.5%) -0.2% ( -3% - 3%) LowSloppyPhrase 25.61 (1.9%) 25.56 (2.0%) -0.2% ( -4% - 3%) LowSpanNear 9.20 (3.6%) 9.20 (3.9%) 0.0% ( -7% - 7%) MedSpanNear 31.32 (2.7%) 31.36 (2.9%) 0.1% ( -5% - 5%) MedSloppyPhrase 28.67 (2.3%) 28.73 (2.2%) 0.2% ( -4% - 4%) Prefix3 18.83 (4.9%) 18.88 (5.4%) 0.3% ( -9% - 11%) OrHighHigh 14.71 (7.5%) 14.76 (8.2%) 0.3% ( -14% - 17%) HighSpanNear 3.39 (3.0%) 3.40 (2.9%) 0.5% ( -5% - 6%) OrHighMed 26.28 (7.3%) 26.47 (8.1%) 0.7% ( -13% - 17%) IntNRQ 3.44 (8.6%) 3.47 (9.3%) 0.9% ( -15% - 20%) OrHighLow 24.69 (7.3%) 24.97 (8.3%) 1.1% ( -13% - 18%) HighSloppyPhrase 0.80 (4.8%) 0.81 (4.7%) 1.3% ( -7% - 11%) HighPhrase 11.89 (6.1%) 12.20 (7.4%) 2.6% ( -10% - 17%) MedPhrase 134.04 (5.5%) 138.07 (6.4%) 3.0% ( -8% - 15%) {noformat} comp .tip file was 16806594 (just a bit smaller than "up to 25% waste"). Curious how PK was barely affected by the "up to 25%", but heavily affected by the "0 waste", and how Respell/Fuzzy2 lost perf going to "up to 25% waste" and then didn't lose much going to "0 waste". I suspect PK will be sensitive to the key structure (luceneutil uses base 36 keys)... Next I tested Kuromoji, just tokenizing first 100K docs from jawiki. * trunk (lots of waste): TokenInfoFST was 1535643 bytes, tokenizing 100K docs took 163.3 sec * up to 25% waste: TokenInfoFST was 1309108 bytes, tokenizing 100K docs took 215.3 sec * 0 waste: TokenInfoFST was 1228847 bytes, tokenizing 100K docs took 218.0 sec Looks like Kuromoji sees good gains from the > 25% waste arcs... but to keep this in perspective, in a "real" app, you index after tokenizing and that indexing time will dwarf the tokenization time. But then on the flip side we are "only" talking about ~221 KB ... Finally, I tested building a no-outputs FST for all Wikipedia English terms (9.8 M terms): * trunk (lots of waste): 59267794 bytes, 566 nsec per lookup * up to 25% waste: 58509011 bytes, 567 nsec per lookup * 0 waste: 56413148 bytes, 1808 nsec per lookup For this case it looks like you get all the benefit allowing only 25% waste (though, most of the byte increase also?). > Reduce wasted bytes in FST due to array arcs > -------------------------------------------- > > Key: LUCENE-4682 > URL: https://issues.apache.org/jira/browse/LUCENE-4682 > Project: Lucene - Core > Issue Type: Improvement > Components: core/FSTs > Reporter: Michael McCandless > Priority: Minor > Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch > > > When a node is close to the root, or it has many outgoing arcs, the FST > writes the arcs as an array (each arc gets N bytes), so we can e.g. bin > search on lookup. > The problem is N is set to the max(numBytesPerArc), so if you have an outlier > arc e.g. with a big output, you can waste many bytes for all the other arcs > that didn't need so many bytes. > I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size > 1535612 = ~18% wasted. > It would be nice to reduce this. > One thing we could do without packing is: in addNode, if we detect that > number of wasted bytes is above some threshold, then don't do the expansion. > Another thing, if we are packing: we could record stats in the first pass > about which nodes wasted the most, and then in the second pass (paack) we > could set the threshold based on the top X% nodes that waste ... > Another idea is maybe to deref large outputs, so that the numBytesPerArc is > more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org