[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts

2018-02-01 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-4198:
---
Attachment: 
TestSimpleTextPostingsFormat.asf.nightly.master.1466.consoleText.excerpt.txt

TestSimpleTextPostingsFormat.sarowe.jenkins.nightly.master.681.consoleText.excerpt.txt

> Allow codecs to index term impacts
> --
>
> Key: LUCENE-4198
> URL: https://issues.apache.org/jira/browse/LUCENE-4198
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: core/index
>Reporter: Robert Muir
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: LUCENE-4198-BMW.patch, LUCENE-4198.patch, 
> LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, 
> LUCENE-4198_flush.patch, 
> TestSimpleTextPostingsFormat.asf.nightly.master.1466.consoleText.excerpt.txt, 
> TestSimpleTextPostingsFormat.sarowe.jenkins.nightly.master.681.consoleText.excerpt.txt
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his 
> implementation currently stores a max for the entire term, the problem is the 
> same).
> We can imagine other similar algorithms too: I think the codec API should be 
> able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a 
> tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. 
> But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the 
> Similarity. Another problem is that it needs access to the term and 
> collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment 
> in a branch with these changes and see if we can make it work well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts

2018-01-19 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4198:
-
Attachment: LUCENE-4198.patch

> Allow codecs to index term impacts
> --
>
> Key: LUCENE-4198
> URL: https://issues.apache.org/jira/browse/LUCENE-4198
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: core/index
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-4198-BMW.patch, LUCENE-4198.patch, 
> LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, 
> LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his 
> implementation currently stores a max for the entire term, the problem is the 
> same).
> We can imagine other similar algorithms too: I think the codec API should be 
> able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a 
> tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. 
> But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the 
> Similarity. Another problem is that it needs access to the term and 
> collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment 
> in a branch with these changes and see if we can make it work well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts

2018-01-19 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4198:
-
Attachment: (was: LUCENE-4198.patch)

> Allow codecs to index term impacts
> --
>
> Key: LUCENE-4198
> URL: https://issues.apache.org/jira/browse/LUCENE-4198
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: core/index
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-4198-BMW.patch, LUCENE-4198.patch, 
> LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, 
> LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his 
> implementation currently stores a max for the entire term, the problem is the 
> same).
> We can imagine other similar algorithms too: I think the codec API should be 
> able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a 
> tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. 
> But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the 
> Similarity. Another problem is that it needs access to the term and 
> collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment 
> in a branch with these changes and see if we can make it work well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts

2018-01-19 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4198:
-
Attachment: LUCENE-4198.patch

> Allow codecs to index term impacts
> --
>
> Key: LUCENE-4198
> URL: https://issues.apache.org/jira/browse/LUCENE-4198
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: core/index
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-4198-BMW.patch, LUCENE-4198.patch, 
> LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, 
> LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his 
> implementation currently stores a max for the entire term, the problem is the 
> same).
> We can imagine other similar algorithms too: I think the codec API should be 
> able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a 
> tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. 
> But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the 
> Similarity. Another problem is that it needs access to the term and 
> collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment 
> in a branch with these changes and see if we can make it work well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts

2018-01-12 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4198:
-
Attachment: LUCENE-4198-BMW.patch

To give some insight into future work on scorers, here is an untested patch 
(the only tests for now are that luceneutil gives the same hits back) that 
implements some ideas from the BMW paper.

The new {{BlockMaxConjunctionScorer}} skips blocks whose sum of max scores is 
less than the max competitive score, and also skips hits when the score of the 
max scoring clause is less than the minimum required score minus max scores of 
other clauses.

{{WANDScorer}} uses the block max scores to get an upper bound of the score of 
the current candidate, which already helps {{OrHighLow}}. It could also skip 
over blocks when the sum of the max scores is not competitive, but the impl 
needs a bit more work than for conjunctions.

{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
 LowTerm 2365.07  (2.8%) 2313.92  (2.5%)   
-2.2% (  -7% -3%)
   OrHighMed   73.78  (2.9%)   72.70  (2.5%)   
-1.5% (  -6% -4%)
   HighTermDayOfYearSort   88.44 (11.4%)   87.15 (13.0%)   
-1.5% ( -23% -   25%)
HighTerm  650.28  (5.8%)  646.81  (5.7%)   
-0.5% ( -11% -   11%)
 Respell  228.08  (2.5%)  227.84  (2.4%)   
-0.1% (  -4% -4%)
 MedTerm 1189.63  (4.2%) 1189.27  (4.6%)   
-0.0% (  -8% -9%)
 MedSpanNear   12.21  (5.0%)   12.24  (5.5%)
0.2% (  -9% -   11%)
HighSpanNear7.26  (5.5%)7.28  (5.8%)
0.2% ( -10% -   12%)
Wildcard  108.43  (7.0%)  108.95  (6.8%)
0.5% ( -12% -   15%)
 Prefix3  128.80  (8.1%)  129.46  (7.8%)
0.5% ( -14% -   17%)
   HighTermMonthSort  172.27  (8.0%)  173.28  (8.0%)
0.6% ( -14% -   18%)
  Fuzzy2  104.86  (5.7%)  105.79  (6.5%)
0.9% ( -10% -   13%)
 LowSloppyPhrase   14.80  (5.6%)   14.93  (6.1%)
0.9% ( -10% -   13%)
 LowSpanNear   95.06  (3.4%)   96.07  (4.2%)
1.1% (  -6% -8%)
HighSloppyPhrase3.96  (8.6%)4.02  (9.7%)
1.6% ( -15% -   21%)
  IntNRQ   29.80  (7.0%)   30.50  (6.9%)
2.4% ( -10% -   17%)
  Fuzzy1  281.25  (4.8%)  288.77  (9.5%)
2.7% ( -11% -   17%)
 MedSloppyPhrase   53.95  (8.0%)   55.43  (9.0%)
2.7% ( -13% -   21%)
  OrHighHigh   23.86  (4.1%)   24.70  (2.7%)
3.5% (  -3% -   10%)
   MedPhrase   42.45  (2.2%)   44.10  (3.2%)
3.9% (  -1% -9%)
   LowPhrase   19.57  (2.7%)   20.47  (3.6%)
4.6% (  -1% -   11%)
  HighPhrase   15.76  (4.1%)   16.91  (5.3%)
7.3% (  -1% -   17%)
   OrHighLow  209.91  (2.3%)  261.10  (3.5%)   
24.4% (  18% -   30%)
 AndHighHigh   27.22  (2.1%)   47.66  (5.1%)   
75.1% (  66% -   84%)
  AndHighLow  514.84  (3.5%)  920.46  (6.0%)   
78.8% (  66% -   91%)
  AndHighMed   56.15  (2.0%)  107.60  (5.4%)   
91.6% (  82% -  101%)
{noformat}



> Allow codecs to index term impacts
> --
>
> Key: LUCENE-4198
> URL: https://issues.apache.org/jira/browse/LUCENE-4198
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: core/index
>Reporter: Robert Muir
> Attachments: LUCENE-4198-BMW.patch, LUCENE-4198.patch, 
> LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, 
> LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his 
> implementation currently stores a max for the entire term, the problem is the 
> same).
> We can imagine other similar algorithms too: I think the codec API should be 
> able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a 
> tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. 
> But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the 
> Similarity. Another problem is that it needs access to the term and 
> collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment 
> in a branch with these changes and see if we can make it work well.



--
This message 

[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts

2018-01-12 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4198:
-
Attachment: LUCENE-4198.patch

I have taken another approach. Issue with {{setMinCompetitiveScore}} is that it 
usually cannot be efficiently leveraged to speed up eg. conjunctions. So I went 
with implementing ideas from the block-max WAND (BMW) paper 
(http://engineering.nyu.edu/~suel/papers/bmw.pdf): the patch introduces a new 
{{ImpactsEnum}} which extends {{PostingsEnum}} and introduces two APIs instead 
of {{setMinCompetitiveScore}}:
 - {{int advanceShallow(int target)}} to get scoring information for documents 
that start at {{target}}. The benefit compared to {{advance}} is that it only 
advances the skip list reader, which is much cheaper: no decoding is happening.
 - {{float getMaxScore(int upTo)}} wich gives information about scores for doc 
ids between the last target to {{advanceShallow}} and {{upTo}}, both included.

Currently only TermScorer leverages this, but the benefit is that we could add 
these APIs to Scorer as well in a follow-up issue so that WANDScorer and 
ConjunctionScorer could leverage them. I built a prototype already to make sure 
that there is an actual speedup for some queries, but I'm leaving it to a 
follow-up issue as indexing impacts is already challenging on its own. One 
thing that it made me change though is that the new patch also stores all 
impacts on the first level, which is written every 128 documents. This seemed 
important for conjunctions, since the maximum score on a given block is not 
always reached, on the contrary to term queries since they match all documents 
in the block. It makes it more important to have good bounds of the score with 
conjunctions than it is with term queries. The disk overhead is still 
acceptable to me: the wikimedium10 index is only 1.4% larger overall, and 
postings alone (the .doc file) is only 3.1% larger.

Here are the benchmark results:

{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
  AndHighLow 1128.91  (3.5%)  875.48  (2.3%)  
-22.4% ( -27% -  -17%)
  AndHighMed  409.67  (2.0%)  331.98  (1.7%)  
-19.0% ( -22% -  -15%)
   OrHighMed  264.99  (3.5%)  229.15  (3.0%)  
-13.5% ( -19% -   -7%)
   OrHighLow  111.47  (4.5%)   98.00  (3.2%)  
-12.1% ( -18% -   -4%)
  OrHighHigh   34.88  (4.2%)   31.69  (4.0%)   
-9.1% ( -16% -   -1%)
OrNotHighLow 1373.74  (5.2%) 1291.72  (4.1%)   
-6.0% ( -14% -3%)
   LowPhrase   78.14  (1.6%)   75.28  (1.2%)   
-3.7% (  -6% -0%)
   MedPhrase   47.49  (1.6%)   45.92  (1.2%)   
-3.3% (  -5% -0%)
 LowSloppyPhrase  208.43  (2.8%)  202.37  (2.7%)   
-2.9% (  -8% -2%)
  Fuzzy1  300.99  (7.7%)  292.78  (8.0%)   
-2.7% ( -17% -   13%)
 LowSpanNear   62.73  (1.4%)   61.09  (1.3%)   
-2.6% (  -5% -0%)
  Fuzzy2  188.37  (7.9%)  184.16  (6.7%)   
-2.2% ( -15% -   13%)
 MedSpanNear   57.41  (1.8%)   56.17  (1.5%)   
-2.2% (  -5% -1%)
 MedSloppyPhrase   23.21  (2.3%)   22.73  (2.3%)   
-2.1% (  -6% -2%)
  HighPhrase   48.75  (3.2%)   47.80  (3.6%)   
-1.9% (  -8% -4%)
HighSpanNear   40.04  (2.9%)   39.35  (2.7%)   
-1.7% (  -7% -4%)
   HighTermMonthSort  228.21  (8.4%)  224.66  (7.9%)   
-1.6% ( -16% -   16%)
HighSloppyPhrase   25.96  (2.8%)   25.61  (3.0%)   
-1.4% (  -6% -4%)
 Respell  284.85  (3.7%)  282.42  (4.0%)   
-0.9% (  -8% -7%)
  IntNRQ   18.87  (5.3%)   18.86  (6.8%)   
-0.1% ( -11% -   12%)
Wildcard   85.50  (5.0%)   86.79  (4.0%)
1.5% (  -7% -   11%)
 Prefix3  137.41  (6.5%)  141.61  (4.9%)
3.1% (  -7% -   15%)
   HighTermDayOfYearSort  116.58  (6.3%)  121.38  (7.2%)
4.1% (  -8% -   18%)
 AndHighHigh   37.64  (1.5%)  118.12  (6.7%)  
213.8% ( 202% -  225%)
 LowTerm  909.13  (2.2%) 3379.38 (11.2%)  
271.7% ( 252% -  291%)
OrNotHighMed  196.21  (1.7%) 1509.92 (28.9%)  
669.6% ( 627% -  712%)
 MedTerm  305.82  (1.7%) 2897.01 (42.5%)  
847.3% ( 789% -  907%)
HighTerm  108.94  (1.7%) 1191.54 (61.3%)  
993.8% ( 915% - 1075%)
OrHighNotMed   81.83  (0.5%) 1082.94 (63.2%) 
1223.5% (1153% - 

[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts

2018-01-05 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4198:
-
Attachment: LUCENE-4198.patch

New patch. This time it has tests, does basic testing in CheckIndex and does 
not clone too much.

Results are very good on queries that score on a single term, almost too good, 
I'm currently thinking about how we could change the API to have something that 
is easier to propagate with boolean queries, even if it means term queries 
can't be as fast.

{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
  AndHighLow 2050.37  (4.2%) 1745.54  (2.0%)  
-14.9% ( -20% -   -9%)
   OrHighLow  922.62  (3.7%)  844.54  (2.4%)   
-8.5% ( -14% -   -2%)
  AndHighMed  277.85  (1.8%)  258.11  (2.6%)   
-7.1% ( -11% -   -2%)
OrNotHighLow 1105.41  (3.6%) 1044.69  (2.0%)   
-5.5% ( -10% -0%)
 AndHighHigh  128.97  (1.1%)  121.89  (2.7%)   
-5.5% (  -9% -   -1%)
  Fuzzy2  166.62  (6.2%)  158.38  (6.3%)   
-4.9% ( -16% -8%)
   OrHighMed  177.56  (2.3%)  170.05  (1.9%)   
-4.2% (  -8% -0%)
  Fuzzy1  199.16  (4.4%)  193.05  (5.5%)   
-3.1% ( -12% -7%)
 MedSloppyPhrase   53.92  (2.2%)   52.40  (2.3%)   
-2.8% (  -7% -1%)
   LowPhrase  201.13  (1.7%)  195.87  (1.0%)   
-2.6% (  -5% -0%)
 LowSpanNear  363.85  (3.0%)  355.07  (2.5%)   
-2.4% (  -7% -3%)
  HighPhrase   62.68  (1.6%)   61.32  (1.2%)   
-2.2% (  -4% -0%)
   HighTermMonthSort  218.42  (9.8%)  214.35  (8.3%)   
-1.9% ( -18% -   18%)
 MedSpanNear   46.65  (1.4%)   45.89  (1.5%)   
-1.6% (  -4% -1%)
   MedPhrase  178.02  (1.5%)  175.24  (1.2%)   
-1.6% (  -4% -1%)
HighSpanNear   10.21  (3.4%)   10.11  (3.4%)   
-1.0% (  -7% -6%)
HighSloppyPhrase   32.32  (7.3%)   32.01  (7.1%)   
-1.0% ( -14% -   14%)
 LowSloppyPhrase   18.01  (2.7%)   17.85  (2.7%)   
-0.9% (  -6% -4%)
 Respell  320.99  (2.1%)  321.02  (2.4%)
0.0% (  -4% -4%)
  IntNRQ   29.29 (11.6%)   29.42 (12.5%)
0.4% ( -21% -   27%)
Wildcard  189.97  (4.6%)  191.87  (3.9%)
1.0% (  -7% -9%)
 Prefix3  166.43  (6.2%)  169.95  (5.4%)
2.1% (  -8% -   14%)
  OrHighHigh   48.00  (3.7%)   49.09  (3.9%)
2.3% (  -5% -   10%)
   HighTermDayOfYearSort  146.88  (7.4%)  150.76  (8.0%)
2.6% ( -11% -   19%)
 LowTerm  830.79  (2.6%) 2246.40  (9.9%)  
170.4% ( 153% -  187%)
OrNotHighMed  180.11  (1.5%) 1454.55 (15.7%)  
707.6% ( 680% -  735%)
 MedTerm  216.16  (1.7%) 3834.73 (37.0%) 
1674.0% (1608% - 1742%)
HighTerm  109.49  (2.0%) 1944.44 (45.3%) 
1675.9% (1597% - 1757%)
OrHighNotMed   57.55  (1.1%) 1292.66 (57.7%) 
2146.2% (2064% - 2229%)
OrHighNotLow   84.00  (1.1%) 1996.82 (75.4%) 
2277.2% (2176% - 2379%)
   OrNotHighHigh   58.22  (1.3%) 1479.53 (53.5%) 
2441.4% (2356% - 2528%)
   OrHighNotHigh   66.91  (1.2%) 2042.54 (55.1%) 
2952.6% (2862% - 3045%)
{noformat}

> Allow codecs to index term impacts
> --
>
> Key: LUCENE-4198
> URL: https://issues.apache.org/jira/browse/LUCENE-4198
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: core/index
>Reporter: Robert Muir
> Attachments: LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, 
> LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his 
> implementation currently stores a max for the entire term, the problem is the 
> same).
> We can imagine other similar algorithms too: I think the codec API should be 
> able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a 
> tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. 
> But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the 
> Similarity. Another problem is that it needs access to the term and 
> collection statistics up front, rather than after the fact.
> This 

[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts

2018-01-04 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4198:
-
Attachment: LUCENE-4198.patch

OK, new iteration. I integrated LUCENE-8116, started to fix corner-cases and 
I've been looking into ways to make the API nicer. Current take is to add 
{{PostingsEnum.setMinCompetitiveScore}} which defaults to a no-op, and 
{{TermsEnum.topPostings(SimScorer)}} which returns a postings that should be 
able to skip low-scoring documents and delegates to {{TermsEnum.postings(null, 
PostingsEnum.FREQS)}} by default.

I still need to work on tests and stop creating a new IndexInput slice for 
every term at index-time. I suppose I could implement getMergeInstance on 
{{Lucene70NormsProducer}} to reuse the same slice across invocations to 
getNorms on the same field.

I'll keep working on this in the next days.

> Allow codecs to index term impacts
> --
>
> Key: LUCENE-4198
> URL: https://issues.apache.org/jira/browse/LUCENE-4198
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: core/index
>Reporter: Robert Muir
> Attachments: LUCENE-4198.patch, LUCENE-4198.patch, 
> LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his 
> implementation currently stores a max for the entire term, the problem is the 
> same).
> We can imagine other similar algorithms too: I think the codec API should be 
> able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a 
> tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. 
> But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the 
> Similarity. Another problem is that it needs access to the term and 
> collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment 
> in a branch with these changes and see if we can make it work well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts

2018-01-03 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4198:
-
Attachment: LUCENE-4198.patch

I have been working on a prototype that adds skip data so that postings could 
know the best potential score for each block of documents. It would be nice to 
not make it Similarity-dependant so that Similarities that use the same norm 
encoding could still be switched at search time like today. So the current 
approach is to store the maximum freq per block when norms are disabled, or all 
competitive (freq,norm) pairs when norms are enabled. This leverages the work 
that has been done on similarities in order to make sure that scores do not 
decrease when freq increases or when the norm increases. This means that 
(freq,norm) is always more competitive than (freq-1,norm) or (freq,norm+1), so 
we don't need to store all (freq,norm) pairs, only competitive ones. At search 
time, the sim scorer is passed to the postings producer so that it can compute 
the maximum score of a block by computing the score for all competitive 
{{(freq,norm)}} pairs.

Note that the attached patch is a rough prototype, it is hacky and not 
everything compiles. I just did the bare minimum so that some basic tests and 
luceneutil can run. There is very little testing. Some notes about the approach:
 - This patch adds the assumption than (unsigned) greater norms produce equal 
or lower scores. I liked this better than adding a new API on Similarity so 
that it could tell us how to compare norms.
 - Skip lists do not store the competitive (freq,norm) pairs on level 0 since 
it could take more storage than the postings block, only level 1 and greater.
 - I had to add norms producers to the postings consumers so that they could 
know about norms.
 - Having to pass the sim scorer to the postings producer is a bit ugly but I 
couldn't figure a way to make it nicer.
 - The similarity API doesn't make it easy to integrate, it currently gives a 
{{score(docID, freq)}} API while we'd rather need a {{score(freq,norm)}} API, 
especially because this optimization only works if freq and norm are the only 
per-document parameters that can influence the score.

Here is what it gives on luceneutil when disabling total hit counts on both 
master and the patch:

{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
 AndHighHigh  127.39  (1.4%)  100.94  (2.4%)  
-20.8% ( -24% -  -17%)
  AndHighMed  240.66  (2.0%)  212.11  (1.3%)  
-11.9% ( -14% -   -8%)
   OrHighMed   76.60  (3.6%)   69.37  (2.3%)   
-9.4% ( -14% -   -3%)
  OrHighHigh   27.37  (3.9%)   24.78  (2.4%)   
-9.4% ( -15% -   -3%)
  Fuzzy1  328.61  (6.5%)  316.04  (5.4%)   
-3.8% ( -14% -8%)
Wildcard   56.88  (7.6%)   55.64 (10.0%)   
-2.2% ( -18% -   16%)
  Fuzzy2  144.68  (3.5%)  142.07  (5.8%)   
-1.8% ( -10% -7%)
 Prefix3  372.69  (6.1%)  366.43  (7.7%)   
-1.7% ( -14% -   12%)
   HighTermDayOfYearSort  132.88  (6.6%)  131.18  (7.7%)   
-1.3% ( -14% -   13%)
 LowSpanNear   53.14  (1.8%)   52.48  (1.9%)   
-1.2% (  -4% -2%)
   HighTermMonthSort  109.37  (7.8%)  108.12  (7.1%)   
-1.1% ( -14% -   14%)
 LowSloppyPhrase   54.79  (1.2%)   54.20  (1.1%)   
-1.1% (  -3% -1%)
 Respell  293.10  (2.9%)  290.77  (5.7%)   
-0.8% (  -9% -8%)
HighSloppyPhrase   35.60  (1.6%)   35.33  (1.6%)   
-0.8% (  -3% -2%)
OrNotHighLow 1686.91  (3.4%) 1675.46  (1.8%)   
-0.7% (  -5% -4%)
  HighPhrase   24.98  (1.9%)   24.82  (1.7%)   
-0.6% (  -4% -3%)
 MedSpanNear  228.02  (3.4%)  226.69  (3.6%)   
-0.6% (  -7% -6%)
 MedSloppyPhrase   46.13  (1.4%)   45.87  (1.3%)   
-0.6% (  -3% -2%)
   MedPhrase  642.58  (3.7%)  639.51  (3.1%)   
-0.5% (  -6% -6%)
   LowPhrase   82.99  (2.1%)   82.63  (1.6%)   
-0.4% (  -3% -3%)
HighSpanNear   34.77  (2.8%)   34.66  (3.1%)   
-0.3% (  -5% -5%)
  IntNRQ   32.59 (15.2%)   32.61 (14.9%)
0.1% ( -26% -   35%)
  AndHighLow 1719.37  (3.8%) 1915.66  (2.8%)   
11.4% (   4% -   18%)
   OrHighLow 1290.65  (3.1%) 1808.66  (3.7%)   
40.1% (  32% -   48%)
 LowTerm  873.82  (3.1%) 1527.34  (7.2%)   
74.8% (  62% -   87%)
OrNotHighMed  285.74  (2.5%)   

[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts

2012-07-05 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-4198:


Attachment: LUCENE-4198_flush.patch

here's a patch fixing how we compute stats in FreqProxTermsWriter: but the 
codec api is unchanged.

Next ill look at merge, which is trickier, and then see about changing the 
codec api.

 Allow codecs to index term impacts
 --

 Key: LUCENE-4198
 URL: https://issues.apache.org/jira/browse/LUCENE-4198
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: core/index
Reporter: Robert Muir
 Attachments: LUCENE-4198_flush.patch


 Subtask of LUCENE-4100.
 Thats an example of something similar to impact indexing (though, his 
 implementation currently stores a max for the entire term, the problem is the 
 same).
 We can imagine other similar algorithms too: I think the codec API should be 
 able to support these.
 Currently it really doesnt: Stefan worked around the problem by providing a 
 tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. 
 But it would be better if we fixed the codec API.
 One problem is that the Postings writer needs to have access to the 
 Similarity. Another problem is that it needs access to the term and 
 collection statistics up front, rather than after the fact.
 This might have some cost (hopefully minimal), so I'm thinking to experiment 
 in a branch with these changes and see if we can make it work well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org