[ https://issues.apache.org/jira/browse/LUCENE-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637983#comment-13637983 ]
Han Jiang commented on LUCENE-2962: ----------------------------------- A full summary of skip frequency in wikimedium.10M.nostopwords.tasks, and part of crazyRandomMinShouldMatch.tasks. The latter one is Really crazy :) The 'skip_len' is actually counted as (newDocUpto-docUpto) in Lucene41PostingsReader.*Enum.advance(target), when skip doesn't move, counted as 0, otherwise the number of docs skipped. I changed codes in luceneutil so that, each line of query is taken into account: #query_category #num_query #num called #max_skip_len #tot_skip_len #avg_skip_len #std_dev_skip_len and_high_high: 500 18021935 14633 110997027 6.158996 25.283280 and_high_med: 500 9145928 22730 233710779 25.553534 61.885853 and_high_low: 500 1385125 215533 1073755035 775.204429 1606.345686 high_phrase: 42 253569 3284 5113544 20.166282 56.904256 high_sloppy_phrase: 42 2441007 3284 11993572 4.913371 23.253660 high_span_near: 42 2362258 3284 11846707 5.014993 23.604965 low_phrase: 500 6936508 21180 247018573 35.611373 103.751734 low_sloppy_phrase: 500 18170618 21180 298025713 16.401518 66.808480 low_span_near: 500 18100056 21180 296733920 16.394089 66.895263 med_phrase: 500 4513849 26367 144556764 32.025166 83.814376 med_sloppy_phrase: 500 17683175 26367 197756027 11.183287 45.764898 med_span_near: 500 17503372 26367 196409780 11.221254 45.958612 10terms_0high_2msm: 22 10875 32768 2502731 230.136184 1319.894640 10terms_0high_3msm: 17 17127 15743 440149 25.699130 209.870841 10terms_0high_4msm: 27 27144 24192 2156919 79.462091 640.948445 10terms_0high_5msm: 21 19564 26479 1829846 93.531282 773.820054 10terms_0high_6msm: 27 17555 31232 1615071 92.000627 745.978516 10terms_0high_7msm: 27 16618 18688 1208893 72.745998 505.996915 10terms_0high_8msm: 25 10722 17024 817872 76.279799 451.833907 10terms_0high_9msm: 16 5371 11008 411776 76.666543 353.098379 10terms_0high_10msm: 21 10403 32768 7325395 704.161780 2504.260576 10terms_5high_2msm: 24 650096 2163 1832245 2.818422 18.513591 10terms_5high_5msm: 24 1123877 276224 128339073 114.193166 936.887693 10terms_5high_10msm: 24 14211 1663232 322730000 22709.872634 115150.031194 This drives me to test, whether it is really necessary to use multi-level skip structure for simpler queries like AndQuery & PhraseQuery. So I set skipMultiplier=8000000 to make sure that Lucene41SkipWriter won't create a level >1 skip list, which is marked as 'comp'. And a clean trunk (skipMultiplier=8) used as 'base': Task QPS base StdDev QPS comp StdDev Pct diff LowPhrase 34.86 (2.7%) 32.23 (1.6%) -7.5% ( -11% - -3%) LowTerm 335.88 (8.8%) 326.09 (7.8%) -2.9% ( -17% - 14%) HighSpanNear 7.05 (2.2%) 6.97 (0.7%) -1.1% ( -3% - 1%) AndHighMed 52.22 (1.3%) 51.72 (1.0%) -1.0% ( -3% - 1%) MedSpanNear 4.30 (2.1%) 4.26 (0.7%) -0.8% ( -3% - 2%) LowSpanNear 42.46 (1.7%) 42.28 (0.6%) -0.4% ( -2% - 1%) Fuzzy2 59.56 (5.4%) 59.31 (4.7%) -0.4% ( -9% - 10%) LowSloppyPhrase 10.33 (2.6%) 10.30 (2.5%) -0.3% ( -5% - 4%) AndHighHigh 18.37 (0.6%) 18.33 (0.3%) -0.2% ( -1% - 0%) Fuzzy1 53.70 (5.3%) 53.59 (5.2%) -0.2% ( -10% - 10%) HighPhrase 2.56 (6.5%) 2.56 (5.6%) -0.2% ( -11% - 12%) HighTerm 57.36 (15.2%) 57.34 (15.0%) -0.0% ( -26% - 35%) MedTerm 90.08 (13.9%) 90.30 (13.6%) 0.2% ( -23% - 32%) IntNRQ 2.82 (13.3%) 2.83 (11.7%) 0.3% ( -21% - 29%) MedPhrase 15.18 (8.6%) 15.23 (8.4%) 0.3% ( -15% - 19%) MedSloppyPhrase 2.17 (4.2%) 2.18 (3.7%) 0.6% ( -6% - 8%) OrHighMed 20.30 (14.5%) 20.47 (14.4%) 0.8% ( -24% - 34%) Wildcard 21.53 (5.6%) 21.71 (4.9%) 0.8% ( -9% - 12%) OrHighLow 17.26 (15.0%) 17.43 (15.0%) 1.0% ( -25% - 36%) HighSloppyPhrase 8.31 (4.2%) 8.39 (4.5%) 1.0% ( -7% - 10%) Prefix3 22.70 (5.8%) 22.93 (5.1%) 1.0% ( -9% - 12%) OrHighHigh 15.51 (14.7%) 15.69 (14.8%) 1.2% ( -24% - 35%) Respell 41.39 (3.4%) 42.01 (3.3%) 1.5% ( -5% - 8%) AndHighLow 459.48 (2.3%) 468.22 (2.1%) 1.9% ( -2% - 6%) PKLookup 251.05 (4.4%) 259.80 (2.8%) 3.5% ( -3% - 11%) > Skip data should be inlined into the postings lists > --------------------------------------------------- > > Key: LUCENE-2962 > URL: https://issues.apache.org/jira/browse/LUCENE-2962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index > Reporter: Michael McCandless > Labels: gsoc2013 > Attachments: proposal.txt > > > Today, we store all skip data as a separate blob at the end of a given term's > postings (if that term occurs in enough docs to warrant skip data). > But this adds overhead during decoding -- we have to seek to a different > place for the initial load, we have to init separate readers, we have to seek > again while using the lower levels of the skip data, etc. Also, we have to > fully decode all skip information even if we are not going to use it (eg if I > only want docIDs, I still must decode position offset and lastPayloadLength). > If instead we interleaved skip data into the postings file, we could keep it > local, and "private" to each file that needs skipping. This should make it > least costly to init and then use the skip data, which'd be a good perf gain > for eg PhraseQuery, AndQuery. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org