[
https://issues.apache.org/jira/browse/LUCENE-7339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-7339:
---------------------------------
Attachment: LUCENE-7339.patch
Here is a patch: it looks at the iterators to intersect and applies all of them
that are an instance of BitSetIterator in a random-access fashion. There is one
exception: if the iterator that has the minimum cost in the list is a
BitSetIterator, then this iterator will still be consumed using the regular
nextDoc/advance DISI API. This makes the optimization safe since we still use
the iterator with the least cost in order to "lead" the iteration.
I also changed the query cache to use a FixedBitSet rather than a
RoaringDocIdSet when the density is greater than 1% (which was the cut-off used
by RandomAccessFilterStrategy). This should help both because FixedBitSet is
faster on dense sets (1% is -2 on the x axis at
http://people.apache.org/~jpountz/doc_id_sets6.html) and also because it will
enable this random-access optimization more often.
I had to hack a bit luceneutil in order to generate conjunctions with an
iterator over a BitSet, which I did by creating a conjunction over a term query
and a numeric range. Here is the patch I applied to luceneutil:
{noformat}
diff --git a/src/main/perf/TaskParser.java b/src/main/perf/TaskParser.java
index 8397b3a..2365159 100644
--- a/src/main/perf/TaskParser.java
+++ b/src/main/perf/TaskParser.java
@@ -244,6 +244,29 @@ class TaskParser {
query = IntPoint.newRangeQuery(nrqFieldName, start, end);
sort = null;
group = null;
+ } else if (text.startsWith("filtered_nrq//")) {
+ // field start end
+ final int spot3 = text.indexOf(' ');
+ if (spot3 == -1) {
+ throw new RuntimeException("failed to parse query=" + text);
+ }
+ final int spot4 = text.indexOf(' ', spot3+1);
+ if (spot4 == -1) {
+ throw new RuntimeException("failed to parse query=" + text);
+ }
+ final int spot5 = text.indexOf(' ', spot4+1);
+ if (spot5 == -1) {
+ throw new RuntimeException("failed to parse query=" + text);
+ }
+ final String nrqFieldName = text.substring("filtered_nrq//".length(),
spot3);
+ final int start = Integer.parseInt(text.substring(1+spot3, spot4));
+ final int end = Integer.parseInt(text.substring(1+spot4, spot5));
+ query = new BooleanQuery.Builder()
+ .add(new TermQuery(new Term("body", text.substring(1+spot5))),
Occur.MUST)
+ .add(IntPoint.newRangeQuery(nrqFieldName, start, end),
Occur.FILTER)
+ .build();
+ sort = null;
+ group = null;
} else if (text.startsWith("datetimesort//")) {
throw new IllegalArgumentException("use lastmodndvsort instead");
} else if (text.startsWith("titlesort//")) {
diff --git a/tasks/wikimedium.10M.nostopwords.tasks
b/tasks/wikimedium.10M.nostopwords.tasks
index 342070c..983361f 100644
--- a/tasks/wikimedium.10M.nostopwords.tasks
+++ b/tasks/wikimedium.10M.nostopwords.tasks
@@ -13361,3 +13361,20 @@ OrNotHighLow: -do necessities # freq=511178 freq=1195
OrHighNotLow: do -necessities # freq=511178 freq=1195
OrNotHighLow: -had halfback # freq=1246743 freq=1205
OrHighNotLow: had -halfback # freq=1246743 freq=1205
+FilteredIntNRQ: filtered_nrq//timesecnum 6207 55832 ref
+FilteredIntNRQ: filtered_nrq//timesecnum 53 85622 http
+FilteredIntNRQ: filtered_nrq//timesecnum 2669 66142 from
+FilteredIntNRQ: filtered_nrq//timesecnum 9936 85687 name
+FilteredIntNRQ: filtered_nrq//timesecnum 23189 61377 title
+FilteredIntNRQ: filtered_nrq//timesecnum 7624 69351 date
+FilteredIntNRQ: filtered_nrq//timesecnum 15733 85583 which
+FilteredIntNRQ: filtered_nrq//timesecnum 8791 69420 also
+FilteredIntNRQ: filtered_nrq//timesecnum 6125 46693 first
+FilteredIntNRQ: filtered_nrq//timesecnum 8006 80120 his
+FilteredIntNRQ: filtered_nrq//timesecnum 11514 45063 cite
+FilteredIntNRQ: filtered_nrq//timesecnum 6342 72089 he
+FilteredIntNRQ: filtered_nrq//timesecnum 10670 66864 new
+FilteredIntNRQ: filtered_nrq//timesecnum 2896 83864 1
+FilteredIntNRQ: filtered_nrq//timesecnum 8947 64612 s
+FilteredIntNRQ: filtered_nrq//timesecnum 2808 75217 2
+FilteredIntNRQ: filtered_nrq//timesecnum 388 84762 one
{noformat}
I got a ~6% speedup when testing it on wikimedium10m:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev
Pct diff
LowTerm 686.83 (7.6%) 677.34 (8.5%)
-1.4% ( -16% - 15%)
HighSloppyPhrase 19.29 (4.3%) 19.03 (4.5%)
-1.3% ( -9% - 7%)
Prefix3 159.38 (5.0%) 157.87 (5.0%)
-0.9% ( -10% - 9%)
OrNotHighMed 191.55 (2.2%) 189.74 (2.3%)
-0.9% ( -5% - 3%)
Wildcard 99.07 (5.9%) 98.18 (5.9%)
-0.9% ( -11% - 11%)
MedSpanNear 38.50 (3.1%) 38.15 (3.2%)
-0.9% ( -7% - 5%)
MedPhrase 88.65 (2.6%) 87.88 (1.8%)
-0.9% ( -5% - 3%)
LowSloppyPhrase 59.35 (2.8%) 58.95 (2.9%)
-0.7% ( -6% - 5%)
OrHighNotHigh 49.32 (4.0%) 49.00 (4.2%)
-0.7% ( -8% - 7%)
OrHighNotMed 86.14 (4.8%) 85.60 (5.4%)
-0.6% ( -10% - 10%)
OrHighNotLow 50.89 (6.1%) 50.58 (5.7%)
-0.6% ( -11% - 11%)
MedSloppyPhrase 76.49 (2.0%) 76.05 (1.9%)
-0.6% ( -4% - 3%)
AndHighHigh 119.69 (1.2%) 119.07 (2.1%)
-0.5% ( -3% - 2%)
OrNotHighLow 1060.23 (3.2%) 1055.14 (3.9%)
-0.5% ( -7% - 6%)
LowSpanNear 229.73 (2.9%) 228.70 (3.0%)
-0.4% ( -6% - 5%)
LowPhrase 151.17 (2.2%) 150.49 (3.0%)
-0.4% ( -5% - 4%)
HighPhrase 43.96 (1.7%) 43.79 (2.3%)
-0.4% ( -4% - 3%)
HighSpanNear 2.23 (7.1%) 2.22 (7.4%)
-0.4% ( -13% - 15%)
IntNRQ 12.11 (8.9%) 12.07 (8.9%)
-0.3% ( -16% - 19%)
OrNotHighHigh 67.08 (3.9%) 66.92 (3.9%)
-0.2% ( -7% - 7%)
MedTerm 157.87 (5.6%) 157.54 (5.1%)
-0.2% ( -10% - 11%)
HighTerm 76.31 (5.8%) 76.16 (5.5%)
-0.2% ( -10% - 11%)
Respell 74.16 (2.3%) 74.01 (3.2%)
-0.2% ( -5% - 5%)
AndHighMed 144.89 (1.8%) 144.62 (1.5%)
-0.2% ( -3% - 3%)
OrHighHigh 31.12 (6.2%) 31.18 (5.7%)
0.2% ( -11% - 12%)
OrHighMed 38.73 (6.0%) 38.84 (5.4%)
0.3% ( -10% - 12%)
OrHighLow 102.78 (4.0%) 103.09 (4.0%)
0.3% ( -7% - 8%)
Fuzzy1 75.62 (16.4%) 76.64 (12.7%)
1.4% ( -23% - 36%)
AndHighLow 785.48 (4.2%) 796.18 (3.5%)
1.4% ( -6% - 9%)
Fuzzy2 45.30 (16.5%) 47.03 (19.4%)
3.8% ( -27% - 47%)
FilteredIntNRQ 10.57 (4.0%) 11.20 (4.4%)
5.9% ( -2% - 14%)
{noformat}
> Bring back RandomAccessFilterStrategy
> -------------------------------------
>
> Key: LUCENE-7339
> URL: https://issues.apache.org/jira/browse/LUCENE-7339
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-7339.patch
>
>
> FiteredQuery had 3 ways of running conjunctions: leap-frog, query first and
> random-access filter. We still use leap-frog for conjunctions and we now have
> a better "query-first" strategy through two-phase iteration. However, we
> don't have any equivalent for the random-access filter strategy.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]