[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts
[ https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-4198: --- Attachment: TestSimpleTextPostingsFormat.asf.nightly.master.1466.consoleText.excerpt.txt TestSimpleTextPostingsFormat.sarowe.jenkins.nightly.master.681.consoleText.excerpt.txt > Allow codecs to index term impacts > -- > > Key: LUCENE-4198 > URL: https://issues.apache.org/jira/browse/LUCENE-4198 > Project: Lucene - Core > Issue Type: Sub-task > Components: core/index >Reporter: Robert Muir >Priority: Major > Fix For: master (8.0) > > Attachments: LUCENE-4198-BMW.patch, LUCENE-4198.patch, > LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, > LUCENE-4198_flush.patch, > TestSimpleTextPostingsFormat.asf.nightly.master.1466.consoleText.excerpt.txt, > TestSimpleTextPostingsFormat.sarowe.jenkins.nightly.master.681.consoleText.excerpt.txt > > > Subtask of LUCENE-4100. > Thats an example of something similar to impact indexing (though, his > implementation currently stores a max for the entire term, the problem is the > same). > We can imagine other similar algorithms too: I think the codec API should be > able to support these. > Currently it really doesnt: Stefan worked around the problem by providing a > tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. > But it would be better if we fixed the codec API. > One problem is that the Postings writer needs to have access to the > Similarity. Another problem is that it needs access to the term and > collection statistics up front, rather than after the fact. > This might have some cost (hopefully minimal), so I'm thinking to experiment > in a branch with these changes and see if we can make it work well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts
[ https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4198: - Attachment: LUCENE-4198.patch > Allow codecs to index term impacts > -- > > Key: LUCENE-4198 > URL: https://issues.apache.org/jira/browse/LUCENE-4198 > Project: Lucene - Core > Issue Type: Sub-task > Components: core/index >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-4198-BMW.patch, LUCENE-4198.patch, > LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, > LUCENE-4198_flush.patch > > > Subtask of LUCENE-4100. > Thats an example of something similar to impact indexing (though, his > implementation currently stores a max for the entire term, the problem is the > same). > We can imagine other similar algorithms too: I think the codec API should be > able to support these. > Currently it really doesnt: Stefan worked around the problem by providing a > tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. > But it would be better if we fixed the codec API. > One problem is that the Postings writer needs to have access to the > Similarity. Another problem is that it needs access to the term and > collection statistics up front, rather than after the fact. > This might have some cost (hopefully minimal), so I'm thinking to experiment > in a branch with these changes and see if we can make it work well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts
[ https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4198: - Attachment: (was: LUCENE-4198.patch) > Allow codecs to index term impacts > -- > > Key: LUCENE-4198 > URL: https://issues.apache.org/jira/browse/LUCENE-4198 > Project: Lucene - Core > Issue Type: Sub-task > Components: core/index >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-4198-BMW.patch, LUCENE-4198.patch, > LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, > LUCENE-4198_flush.patch > > > Subtask of LUCENE-4100. > Thats an example of something similar to impact indexing (though, his > implementation currently stores a max for the entire term, the problem is the > same). > We can imagine other similar algorithms too: I think the codec API should be > able to support these. > Currently it really doesnt: Stefan worked around the problem by providing a > tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. > But it would be better if we fixed the codec API. > One problem is that the Postings writer needs to have access to the > Similarity. Another problem is that it needs access to the term and > collection statistics up front, rather than after the fact. > This might have some cost (hopefully minimal), so I'm thinking to experiment > in a branch with these changes and see if we can make it work well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts
[ https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4198: - Attachment: LUCENE-4198.patch > Allow codecs to index term impacts > -- > > Key: LUCENE-4198 > URL: https://issues.apache.org/jira/browse/LUCENE-4198 > Project: Lucene - Core > Issue Type: Sub-task > Components: core/index >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-4198-BMW.patch, LUCENE-4198.patch, > LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, > LUCENE-4198_flush.patch > > > Subtask of LUCENE-4100. > Thats an example of something similar to impact indexing (though, his > implementation currently stores a max for the entire term, the problem is the > same). > We can imagine other similar algorithms too: I think the codec API should be > able to support these. > Currently it really doesnt: Stefan worked around the problem by providing a > tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. > But it would be better if we fixed the codec API. > One problem is that the Postings writer needs to have access to the > Similarity. Another problem is that it needs access to the term and > collection statistics up front, rather than after the fact. > This might have some cost (hopefully minimal), so I'm thinking to experiment > in a branch with these changes and see if we can make it work well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts
[ https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4198: - Attachment: LUCENE-4198-BMW.patch To give some insight into future work on scorers, here is an untested patch (the only tests for now are that luceneutil gives the same hits back) that implements some ideas from the BMW paper. The new {{BlockMaxConjunctionScorer}} skips blocks whose sum of max scores is less than the max competitive score, and also skips hits when the score of the max scoring clause is less than the minimum required score minus max scores of other clauses. {{WANDScorer}} uses the block max scores to get an upper bound of the score of the current candidate, which already helps {{OrHighLow}}. It could also skip over blocks when the sum of the max scores is not competitive, but the impl needs a bit more work than for conjunctions. {noformat} TaskQPS baseline StdDev QPS patch StdDev Pct diff LowTerm 2365.07 (2.8%) 2313.92 (2.5%) -2.2% ( -7% -3%) OrHighMed 73.78 (2.9%) 72.70 (2.5%) -1.5% ( -6% -4%) HighTermDayOfYearSort 88.44 (11.4%) 87.15 (13.0%) -1.5% ( -23% - 25%) HighTerm 650.28 (5.8%) 646.81 (5.7%) -0.5% ( -11% - 11%) Respell 228.08 (2.5%) 227.84 (2.4%) -0.1% ( -4% -4%) MedTerm 1189.63 (4.2%) 1189.27 (4.6%) -0.0% ( -8% -9%) MedSpanNear 12.21 (5.0%) 12.24 (5.5%) 0.2% ( -9% - 11%) HighSpanNear7.26 (5.5%)7.28 (5.8%) 0.2% ( -10% - 12%) Wildcard 108.43 (7.0%) 108.95 (6.8%) 0.5% ( -12% - 15%) Prefix3 128.80 (8.1%) 129.46 (7.8%) 0.5% ( -14% - 17%) HighTermMonthSort 172.27 (8.0%) 173.28 (8.0%) 0.6% ( -14% - 18%) Fuzzy2 104.86 (5.7%) 105.79 (6.5%) 0.9% ( -10% - 13%) LowSloppyPhrase 14.80 (5.6%) 14.93 (6.1%) 0.9% ( -10% - 13%) LowSpanNear 95.06 (3.4%) 96.07 (4.2%) 1.1% ( -6% -8%) HighSloppyPhrase3.96 (8.6%)4.02 (9.7%) 1.6% ( -15% - 21%) IntNRQ 29.80 (7.0%) 30.50 (6.9%) 2.4% ( -10% - 17%) Fuzzy1 281.25 (4.8%) 288.77 (9.5%) 2.7% ( -11% - 17%) MedSloppyPhrase 53.95 (8.0%) 55.43 (9.0%) 2.7% ( -13% - 21%) OrHighHigh 23.86 (4.1%) 24.70 (2.7%) 3.5% ( -3% - 10%) MedPhrase 42.45 (2.2%) 44.10 (3.2%) 3.9% ( -1% -9%) LowPhrase 19.57 (2.7%) 20.47 (3.6%) 4.6% ( -1% - 11%) HighPhrase 15.76 (4.1%) 16.91 (5.3%) 7.3% ( -1% - 17%) OrHighLow 209.91 (2.3%) 261.10 (3.5%) 24.4% ( 18% - 30%) AndHighHigh 27.22 (2.1%) 47.66 (5.1%) 75.1% ( 66% - 84%) AndHighLow 514.84 (3.5%) 920.46 (6.0%) 78.8% ( 66% - 91%) AndHighMed 56.15 (2.0%) 107.60 (5.4%) 91.6% ( 82% - 101%) {noformat} > Allow codecs to index term impacts > -- > > Key: LUCENE-4198 > URL: https://issues.apache.org/jira/browse/LUCENE-4198 > Project: Lucene - Core > Issue Type: Sub-task > Components: core/index >Reporter: Robert Muir > Attachments: LUCENE-4198-BMW.patch, LUCENE-4198.patch, > LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, > LUCENE-4198_flush.patch > > > Subtask of LUCENE-4100. > Thats an example of something similar to impact indexing (though, his > implementation currently stores a max for the entire term, the problem is the > same). > We can imagine other similar algorithms too: I think the codec API should be > able to support these. > Currently it really doesnt: Stefan worked around the problem by providing a > tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. > But it would be better if we fixed the codec API. > One problem is that the Postings writer needs to have access to the > Similarity. Another problem is that it needs access to the term and > collection statistics up front, rather than after the fact. > This might have some cost (hopefully minimal), so I'm thinking to experiment > in a branch with these changes and see if we can make it work well. -- This message was
[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts
[ https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4198: - Attachment: LUCENE-4198.patch I have taken another approach. Issue with {{setMinCompetitiveScore}} is that it usually cannot be efficiently leveraged to speed up eg. conjunctions. So I went with implementing ideas from the block-max WAND (BMW) paper (http://engineering.nyu.edu/~suel/papers/bmw.pdf): the patch introduces a new {{ImpactsEnum}} which extends {{PostingsEnum}} and introduces two APIs instead of {{setMinCompetitiveScore}}: - {{int advanceShallow(int target)}} to get scoring information for documents that start at {{target}}. The benefit compared to {{advance}} is that it only advances the skip list reader, which is much cheaper: no decoding is happening. - {{float getMaxScore(int upTo)}} wich gives information about scores for doc ids between the last target to {{advanceShallow}} and {{upTo}}, both included. Currently only TermScorer leverages this, but the benefit is that we could add these APIs to Scorer as well in a follow-up issue so that WANDScorer and ConjunctionScorer could leverage them. I built a prototype already to make sure that there is an actual speedup for some queries, but I'm leaving it to a follow-up issue as indexing impacts is already challenging on its own. One thing that it made me change though is that the new patch also stores all impacts on the first level, which is written every 128 documents. This seemed important for conjunctions, since the maximum score on a given block is not always reached, on the contrary to term queries since they match all documents in the block. It makes it more important to have good bounds of the score with conjunctions than it is with term queries. The disk overhead is still acceptable to me: the wikimedium10 index is only 1.4% larger overall, and postings alone (the .doc file) is only 3.1% larger. Here are the benchmark results: {noformat} TaskQPS baseline StdDev QPS patch StdDev Pct diff AndHighLow 1128.91 (3.5%) 875.48 (2.3%) -22.4% ( -27% - -17%) AndHighMed 409.67 (2.0%) 331.98 (1.7%) -19.0% ( -22% - -15%) OrHighMed 264.99 (3.5%) 229.15 (3.0%) -13.5% ( -19% - -7%) OrHighLow 111.47 (4.5%) 98.00 (3.2%) -12.1% ( -18% - -4%) OrHighHigh 34.88 (4.2%) 31.69 (4.0%) -9.1% ( -16% - -1%) OrNotHighLow 1373.74 (5.2%) 1291.72 (4.1%) -6.0% ( -14% -3%) LowPhrase 78.14 (1.6%) 75.28 (1.2%) -3.7% ( -6% -0%) MedPhrase 47.49 (1.6%) 45.92 (1.2%) -3.3% ( -5% -0%) LowSloppyPhrase 208.43 (2.8%) 202.37 (2.7%) -2.9% ( -8% -2%) Fuzzy1 300.99 (7.7%) 292.78 (8.0%) -2.7% ( -17% - 13%) LowSpanNear 62.73 (1.4%) 61.09 (1.3%) -2.6% ( -5% -0%) Fuzzy2 188.37 (7.9%) 184.16 (6.7%) -2.2% ( -15% - 13%) MedSpanNear 57.41 (1.8%) 56.17 (1.5%) -2.2% ( -5% -1%) MedSloppyPhrase 23.21 (2.3%) 22.73 (2.3%) -2.1% ( -6% -2%) HighPhrase 48.75 (3.2%) 47.80 (3.6%) -1.9% ( -8% -4%) HighSpanNear 40.04 (2.9%) 39.35 (2.7%) -1.7% ( -7% -4%) HighTermMonthSort 228.21 (8.4%) 224.66 (7.9%) -1.6% ( -16% - 16%) HighSloppyPhrase 25.96 (2.8%) 25.61 (3.0%) -1.4% ( -6% -4%) Respell 284.85 (3.7%) 282.42 (4.0%) -0.9% ( -8% -7%) IntNRQ 18.87 (5.3%) 18.86 (6.8%) -0.1% ( -11% - 12%) Wildcard 85.50 (5.0%) 86.79 (4.0%) 1.5% ( -7% - 11%) Prefix3 137.41 (6.5%) 141.61 (4.9%) 3.1% ( -7% - 15%) HighTermDayOfYearSort 116.58 (6.3%) 121.38 (7.2%) 4.1% ( -8% - 18%) AndHighHigh 37.64 (1.5%) 118.12 (6.7%) 213.8% ( 202% - 225%) LowTerm 909.13 (2.2%) 3379.38 (11.2%) 271.7% ( 252% - 291%) OrNotHighMed 196.21 (1.7%) 1509.92 (28.9%) 669.6% ( 627% - 712%) MedTerm 305.82 (1.7%) 2897.01 (42.5%) 847.3% ( 789% - 907%) HighTerm 108.94 (1.7%) 1191.54 (61.3%) 993.8% ( 915% - 1075%) OrHighNotMed 81.83 (0.5%) 1082.94 (63.2%) 1223.5% (1153% - 12
[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts
[ https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4198: - Attachment: LUCENE-4198.patch New patch. This time it has tests, does basic testing in CheckIndex and does not clone too much. Results are very good on queries that score on a single term, almost too good, I'm currently thinking about how we could change the API to have something that is easier to propagate with boolean queries, even if it means term queries can't be as fast. {noformat} TaskQPS baseline StdDev QPS patch StdDev Pct diff AndHighLow 2050.37 (4.2%) 1745.54 (2.0%) -14.9% ( -20% - -9%) OrHighLow 922.62 (3.7%) 844.54 (2.4%) -8.5% ( -14% - -2%) AndHighMed 277.85 (1.8%) 258.11 (2.6%) -7.1% ( -11% - -2%) OrNotHighLow 1105.41 (3.6%) 1044.69 (2.0%) -5.5% ( -10% -0%) AndHighHigh 128.97 (1.1%) 121.89 (2.7%) -5.5% ( -9% - -1%) Fuzzy2 166.62 (6.2%) 158.38 (6.3%) -4.9% ( -16% -8%) OrHighMed 177.56 (2.3%) 170.05 (1.9%) -4.2% ( -8% -0%) Fuzzy1 199.16 (4.4%) 193.05 (5.5%) -3.1% ( -12% -7%) MedSloppyPhrase 53.92 (2.2%) 52.40 (2.3%) -2.8% ( -7% -1%) LowPhrase 201.13 (1.7%) 195.87 (1.0%) -2.6% ( -5% -0%) LowSpanNear 363.85 (3.0%) 355.07 (2.5%) -2.4% ( -7% -3%) HighPhrase 62.68 (1.6%) 61.32 (1.2%) -2.2% ( -4% -0%) HighTermMonthSort 218.42 (9.8%) 214.35 (8.3%) -1.9% ( -18% - 18%) MedSpanNear 46.65 (1.4%) 45.89 (1.5%) -1.6% ( -4% -1%) MedPhrase 178.02 (1.5%) 175.24 (1.2%) -1.6% ( -4% -1%) HighSpanNear 10.21 (3.4%) 10.11 (3.4%) -1.0% ( -7% -6%) HighSloppyPhrase 32.32 (7.3%) 32.01 (7.1%) -1.0% ( -14% - 14%) LowSloppyPhrase 18.01 (2.7%) 17.85 (2.7%) -0.9% ( -6% -4%) Respell 320.99 (2.1%) 321.02 (2.4%) 0.0% ( -4% -4%) IntNRQ 29.29 (11.6%) 29.42 (12.5%) 0.4% ( -21% - 27%) Wildcard 189.97 (4.6%) 191.87 (3.9%) 1.0% ( -7% -9%) Prefix3 166.43 (6.2%) 169.95 (5.4%) 2.1% ( -8% - 14%) OrHighHigh 48.00 (3.7%) 49.09 (3.9%) 2.3% ( -5% - 10%) HighTermDayOfYearSort 146.88 (7.4%) 150.76 (8.0%) 2.6% ( -11% - 19%) LowTerm 830.79 (2.6%) 2246.40 (9.9%) 170.4% ( 153% - 187%) OrNotHighMed 180.11 (1.5%) 1454.55 (15.7%) 707.6% ( 680% - 735%) MedTerm 216.16 (1.7%) 3834.73 (37.0%) 1674.0% (1608% - 1742%) HighTerm 109.49 (2.0%) 1944.44 (45.3%) 1675.9% (1597% - 1757%) OrHighNotMed 57.55 (1.1%) 1292.66 (57.7%) 2146.2% (2064% - 2229%) OrHighNotLow 84.00 (1.1%) 1996.82 (75.4%) 2277.2% (2176% - 2379%) OrNotHighHigh 58.22 (1.3%) 1479.53 (53.5%) 2441.4% (2356% - 2528%) OrHighNotHigh 66.91 (1.2%) 2042.54 (55.1%) 2952.6% (2862% - 3045%) {noformat} > Allow codecs to index term impacts > -- > > Key: LUCENE-4198 > URL: https://issues.apache.org/jira/browse/LUCENE-4198 > Project: Lucene - Core > Issue Type: Sub-task > Components: core/index >Reporter: Robert Muir > Attachments: LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, > LUCENE-4198_flush.patch > > > Subtask of LUCENE-4100. > Thats an example of something similar to impact indexing (though, his > implementation currently stores a max for the entire term, the problem is the > same). > We can imagine other similar algorithms too: I think the codec API should be > able to support these. > Currently it really doesnt: Stefan worked around the problem by providing a > tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. > But it would be better if we fixed the codec API. > One problem is that the Postings writer needs to have access to the > Similarity. Another problem is that it needs access to the term and > collection statistics up front, rather than after the fact. > This might
[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts
[ https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4198: - Attachment: LUCENE-4198.patch OK, new iteration. I integrated LUCENE-8116, started to fix corner-cases and I've been looking into ways to make the API nicer. Current take is to add {{PostingsEnum.setMinCompetitiveScore}} which defaults to a no-op, and {{TermsEnum.topPostings(SimScorer)}} which returns a postings that should be able to skip low-scoring documents and delegates to {{TermsEnum.postings(null, PostingsEnum.FREQS)}} by default. I still need to work on tests and stop creating a new IndexInput slice for every term at index-time. I suppose I could implement getMergeInstance on {{Lucene70NormsProducer}} to reuse the same slice across invocations to getNorms on the same field. I'll keep working on this in the next days. > Allow codecs to index term impacts > -- > > Key: LUCENE-4198 > URL: https://issues.apache.org/jira/browse/LUCENE-4198 > Project: Lucene - Core > Issue Type: Sub-task > Components: core/index >Reporter: Robert Muir > Attachments: LUCENE-4198.patch, LUCENE-4198.patch, > LUCENE-4198_flush.patch > > > Subtask of LUCENE-4100. > Thats an example of something similar to impact indexing (though, his > implementation currently stores a max for the entire term, the problem is the > same). > We can imagine other similar algorithms too: I think the codec API should be > able to support these. > Currently it really doesnt: Stefan worked around the problem by providing a > tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. > But it would be better if we fixed the codec API. > One problem is that the Postings writer needs to have access to the > Similarity. Another problem is that it needs access to the term and > collection statistics up front, rather than after the fact. > This might have some cost (hopefully minimal), so I'm thinking to experiment > in a branch with these changes and see if we can make it work well. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts
[ https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4198: - Attachment: LUCENE-4198.patch I have been working on a prototype that adds skip data so that postings could know the best potential score for each block of documents. It would be nice to not make it Similarity-dependant so that Similarities that use the same norm encoding could still be switched at search time like today. So the current approach is to store the maximum freq per block when norms are disabled, or all competitive (freq,norm) pairs when norms are enabled. This leverages the work that has been done on similarities in order to make sure that scores do not decrease when freq increases or when the norm increases. This means that (freq,norm) is always more competitive than (freq-1,norm) or (freq,norm+1), so we don't need to store all (freq,norm) pairs, only competitive ones. At search time, the sim scorer is passed to the postings producer so that it can compute the maximum score of a block by computing the score for all competitive {{(freq,norm)}} pairs. Note that the attached patch is a rough prototype, it is hacky and not everything compiles. I just did the bare minimum so that some basic tests and luceneutil can run. There is very little testing. Some notes about the approach: - This patch adds the assumption than (unsigned) greater norms produce equal or lower scores. I liked this better than adding a new API on Similarity so that it could tell us how to compare norms. - Skip lists do not store the competitive (freq,norm) pairs on level 0 since it could take more storage than the postings block, only level 1 and greater. - I had to add norms producers to the postings consumers so that they could know about norms. - Having to pass the sim scorer to the postings producer is a bit ugly but I couldn't figure a way to make it nicer. - The similarity API doesn't make it easy to integrate, it currently gives a {{score(docID, freq)}} API while we'd rather need a {{score(freq,norm)}} API, especially because this optimization only works if freq and norm are the only per-document parameters that can influence the score. Here is what it gives on luceneutil when disabling total hit counts on both master and the patch: {noformat} TaskQPS baseline StdDev QPS patch StdDev Pct diff AndHighHigh 127.39 (1.4%) 100.94 (2.4%) -20.8% ( -24% - -17%) AndHighMed 240.66 (2.0%) 212.11 (1.3%) -11.9% ( -14% - -8%) OrHighMed 76.60 (3.6%) 69.37 (2.3%) -9.4% ( -14% - -3%) OrHighHigh 27.37 (3.9%) 24.78 (2.4%) -9.4% ( -15% - -3%) Fuzzy1 328.61 (6.5%) 316.04 (5.4%) -3.8% ( -14% -8%) Wildcard 56.88 (7.6%) 55.64 (10.0%) -2.2% ( -18% - 16%) Fuzzy2 144.68 (3.5%) 142.07 (5.8%) -1.8% ( -10% -7%) Prefix3 372.69 (6.1%) 366.43 (7.7%) -1.7% ( -14% - 12%) HighTermDayOfYearSort 132.88 (6.6%) 131.18 (7.7%) -1.3% ( -14% - 13%) LowSpanNear 53.14 (1.8%) 52.48 (1.9%) -1.2% ( -4% -2%) HighTermMonthSort 109.37 (7.8%) 108.12 (7.1%) -1.1% ( -14% - 14%) LowSloppyPhrase 54.79 (1.2%) 54.20 (1.1%) -1.1% ( -3% -1%) Respell 293.10 (2.9%) 290.77 (5.7%) -0.8% ( -9% -8%) HighSloppyPhrase 35.60 (1.6%) 35.33 (1.6%) -0.8% ( -3% -2%) OrNotHighLow 1686.91 (3.4%) 1675.46 (1.8%) -0.7% ( -5% -4%) HighPhrase 24.98 (1.9%) 24.82 (1.7%) -0.6% ( -4% -3%) MedSpanNear 228.02 (3.4%) 226.69 (3.6%) -0.6% ( -7% -6%) MedSloppyPhrase 46.13 (1.4%) 45.87 (1.3%) -0.6% ( -3% -2%) MedPhrase 642.58 (3.7%) 639.51 (3.1%) -0.5% ( -6% -6%) LowPhrase 82.99 (2.1%) 82.63 (1.6%) -0.4% ( -3% -3%) HighSpanNear 34.77 (2.8%) 34.66 (3.1%) -0.3% ( -5% -5%) IntNRQ 32.59 (15.2%) 32.61 (14.9%) 0.1% ( -26% - 35%) AndHighLow 1719.37 (3.8%) 1915.66 (2.8%) 11.4% ( 4% - 18%) OrHighLow 1290.65 (3.1%) 1808.66 (3.7%) 40.1% ( 32% - 48%) LowTerm 873.82 (3.1%) 1527.34 (7.2%) 74.8% ( 62% - 87%) OrNotHighMed 285.74 (2.5%)
[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts
[ https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-4198: Attachment: LUCENE-4198_flush.patch here's a patch fixing how we compute stats in FreqProxTermsWriter: but the codec api is unchanged. Next ill look at merge, which is trickier, and then see about changing the codec api. > Allow codecs to index term impacts > -- > > Key: LUCENE-4198 > URL: https://issues.apache.org/jira/browse/LUCENE-4198 > Project: Lucene - Java > Issue Type: Sub-task > Components: core/index >Reporter: Robert Muir > Attachments: LUCENE-4198_flush.patch > > > Subtask of LUCENE-4100. > Thats an example of something similar to impact indexing (though, his > implementation currently stores a max for the entire term, the problem is the > same). > We can imagine other similar algorithms too: I think the codec API should be > able to support these. > Currently it really doesnt: Stefan worked around the problem by providing a > tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. > But it would be better if we fixed the codec API. > One problem is that the Postings writer needs to have access to the > Similarity. Another problem is that it needs access to the term and > collection statistics up front, rather than after the fact. > This might have some cost (hopefully minimal), so I'm thinking to experiment > in a branch with these changes and see if we can make it work well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org