[
https://issues.apache.org/jira/browse/LUCENE-7618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hoss Man updated LUCENE-7618:
-----------------------------
Attachment: LUCENE-7618.patch
*TL;DR:* based on some half assed benchmarking using junit timings, the results
did *NOT* show any improvements -- for some seeds (on my laptop) the new code
was actaully slower sometimes, leading me to suspect I may have a bug somewhere
in either the test or the code, but I haven't had time to dig into it.
----
*Background...*
Just before I went on vacation, I had a conversation with someone who asked a
question that can be summarized as:
bq. If I already know the smallest possible value ({{minF}}) in the field
{{foo}}, which is faster: {{foo:\[\* TO X\]}} or {{foo:\[minF TO X\]}} ?
_(where X will be diff in every query)_
My answer, based on recollection of how simple term indexing range queries use
to work, was that any difference should be negligable -- in the {{\*}} case
lucene can skip the call to {{seekTo(minF)}} but otherwise the executation
would be identical after that. For DocValues, I speculated that using {{\*}}
should theoretically be faster, because as the code iterates ove the DocValues,
it would only have to compare each doc's value to {{X}}, and not to {{minF}}.
When I dug into the code, I found that wasn't the case: while query rewriting
had a special case optimization for {{foo:\[\* TO \*\]}} there were no special
case optimizations for a range query open on one end -- instead the
TwoPhaseIterator created for SortedNumericDocValues assigns
Long.MIN_VALUE/Long.MAX_VALUE as needed anytime min/max are null, and does 2
comparisons per doc value. Likewise the codepath for SortedSetDocValues
assigns minOrd ({{0}}) and maxOrd ({{values.getValueCount() - 1}}) when the
user specified min/max are null; and again did 2 comparisons per value.
*Theory...*
Adding special case handling for the open ended range queries, to return
TwoPhaseIterators that only did one comparison in these special cases,
seem(ed|s) like an easy win? In particular for the case of SortedSetDocValues:
Even if the original query has non-null min & max values, after calling
{{lookupTerm(...)}} on each we might find that one of them is completely out of
range _for individual segments_ and optimize away one comparison on a
per-segment level. (an existing optimization already handled the case where
the min/max were both completely below/above the range of values in the segment)
*Attempt...*
Step #1 was writting a {{TestDocValuesRangeQueryOptimizations}} (see patch)
against the existing code that built up an index with identical values in 3
diff fields (SortedNumeric, SortedSet, and Points) and then did a lot of
randomized, open ended, range queries against that index, one field per test
method.
Once I had the basics of the test written, I then ran the test over and over
with the same seed and noted down the junit test times for each.
I then added what seemed like the most straight forward optimizations to
{{DocValuesRangeQuery}}, and re-ran the test a few times with the same seed.
*Results...*
The results were not good. Looking back, didn't even bother to save the tmp
file where I had been pasting the junit times -- that's how not good they were.
Even accounting for potential "noise" in the results from other stuff running
on my laptop, there was no inkling of any speedup in the new code (as compared
to the test against the points field which didn't involve any modified code
paths) and IIRC the results were occasionally slower.
I set all this aside to come back to later to try and figure out if I'd made
any obvious mistakes, but skimming the code today, and writting up my
recollections, nothing jumps out at me. My best guess is that HotSpot is
already optimizing away some of these comparisons, and that the new code just
complicates things for HotSpot to the point that it's sometimes slower. I
don't plan on taking this any further, but I wanted to open the issue to track
the idea in case anyone else sees value in pursuing it futher.
> Hypothetical perf improvements in DocValuesRangeQuery: reducing comparisons
> for some queries/segments
> -----------------------------------------------------------------------------------------------------
>
> Key: LUCENE-7618
> URL: https://issues.apache.org/jira/browse/LUCENE-7618
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Hoss Man
> Attachments: LUCENE-7618.patch
>
>
> In reviewing the DocValuesRangeQuery code, it occured to me that there
> _might_ be some potential performance optimizations possible in a few cases
> relating queries that involve explicitly specified open ranges (ie: min or
> max are null) or in the case of SortedSet: range queries that are
> *effectively* open ended on particular segments, because the min/max are
> below/above the minOrd/maxOrd for the segment.
> Since these seemed like semi-common situations (open ended range queries are
> fairly common in my experience, i'm not sure about the secondary SortedSet
> "ord" case, but it seemd potentially promising particularly for fields like
> incrementing ids, or timestamps, where values are added sequentially and
> likeley to be clustered together) I did a bit of experimenting and wanted to
> post my findings in jira -- patch & details to follow in comments.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]