[
https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324101#comment-16324101
]
Adrien Grand commented on LUCENE-4198:
--------------------------------------
I tested wikibigall as well, which has the benefit of not having artificially
truncated lengths like wikimedium:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev
Pct diff
AndHighLow 1440.24 (3.0%) 794.43 (2.9%)
-44.8% ( -49% - -40%)
AndHighMed 121.80 (1.4%) 94.75 (1.5%)
-22.2% ( -24% - -19%)
AndHighHigh 56.62 (1.2%) 45.26 (1.4%)
-20.1% ( -22% - -17%)
OrHighMed 93.16 (3.3%) 78.18 (3.1%)
-16.1% ( -21% - -9%)
OrHighLow 827.62 (2.6%) 748.49 (3.5%)
-9.6% ( -15% - -3%)
OrHighHigh 35.14 (4.4%) 32.25 (4.6%)
-8.2% ( -16% - 0%)
Fuzzy1 265.67 (4.7%) 246.12 (5.0%)
-7.4% ( -16% - 2%)
LowPhrase 166.32 (1.3%) 157.61 (1.6%)
-5.2% ( -8% - -2%)
Fuzzy2 184.41 (4.3%) 176.40 (3.5%)
-4.3% ( -11% - 3%)
LowSpanNear 749.77 (2.1%) 726.14 (2.2%)
-3.2% ( -7% - 1%)
MedPhrase 23.77 (2.0%) 23.14 (1.9%)
-2.6% ( -6% - 1%)
HighPhrase 18.73 (3.0%) 18.24 (3.0%)
-2.6% ( -8% - 3%)
MedSpanNear 113.11 (2.3%) 110.17 (2.0%)
-2.6% ( -6% - 1%)
MedSloppyPhrase 10.28 (6.5%) 10.07 (6.9%)
-2.0% ( -14% - 12%)
LowSloppyPhrase 12.68 (6.6%) 12.43 (7.1%)
-2.0% ( -14% - 12%)
HighSloppyPhrase 9.47 (7.0%) 9.29 (7.5%)
-1.9% ( -15% - 13%)
IntNRQ 27.89 (7.0%) 27.58 (8.7%)
-1.1% ( -15% - 15%)
HighSpanNear 9.05 (4.9%) 8.98 (4.7%)
-0.8% ( -9% - 9%)
Respell 273.80 (2.3%) 273.79 (2.2%)
-0.0% ( -4% - 4%)
HighTermMonthSort 68.77 (7.1%) 69.60 (7.8%)
1.2% ( -12% - 17%)
Wildcard 92.81 (5.8%) 94.67 (6.2%)
2.0% ( -9% - 14%)
HighTermDayOfYearSort 61.99 (10.3%) 64.18 (10.9%)
3.5% ( -16% - 27%)
Prefix3 41.42 (8.3%) 42.96 (8.2%)
3.7% ( -11% - 22%)
LowTerm 694.99 (2.5%) 3126.69 (17.7%)
349.9% ( 321% - 379%)
HighTerm 58.04 (2.7%) 490.60 (58.6%)
745.3% ( 666% - 828%)
MedTerm 120.80 (2.6%) 1053.44 (55.1%)
772.1% ( 695% - 852%)
{noformat}
{{.doc}} file is 5.2% larger and the index is 1.5% larger overall.
> Allow codecs to index term impacts
> ----------------------------------
>
> Key: LUCENE-4198
> URL: https://issues.apache.org/jira/browse/LUCENE-4198
> Project: Lucene - Core
> Issue Type: Sub-task
> Components: core/index
> Reporter: Robert Muir
> Attachments: LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch,
> LUCENE-4198.patch, LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his
> implementation currently stores a max for the entire term, the problem is the
> same).
> We can imagine other similar algorithms too: I think the codec API should be
> able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a
> tool to 'rewrite' your index, he passes the IndexReader and Similarity to it.
> But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the
> Similarity. Another problem is that it needs access to the term and
> collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment
> in a branch with these changes and see if we can make it work well.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]