Thanks Robert/Adrien for the quick thoughts!

Robert- Just to clarify a little bit (slightly more caffeinated now): For
the cases where an indexed rectangle fits entirely within the query
rectangle, these would be "confirmed" matches up-front, relying on the
efficiency of the BKD-tree data structure (as you describe). The second
phase confirmation I had in mind were the cases where the query
rectangle overlaps,
or sits inside the indexed rectangle, and the individual points/docs within
the indexed rectangle need to be checked against the boundaries of the
query rectangle. That step seemed like a good candidate for a second phase,
effectively avoiding that check if it doesn't end up being necessary. But,
if that is relatively rare compared to the case where an indexed shape fits
entirely within the query shape, then maybe there isn't value.

Adrien- Oh cool! Was not aware of this actually, but after reading the blog
and looking at the code a little bit, this is in the neighborhood of what I
was thinking. I suppose what I had in mind is a little bit of a hybrid of
the two approaches currently implemented. I wonder if, in the case where
the range query is used to "verify" matches (to borrow from your
terminology), there's a benefit to still using the BKD-tree to quickly
eliminate docs that don't match before using the doc values to confirm
matches for "approximate" matches. Said differently, what if there was a
range query implementation that collected "definite" matches as well as
"potential" matches using the BKD-tree up-front (so definite matches are
docs where the indexed shape is entirely contained in the query shape and
potential matches are cases where the indexed shape overlaps the query
shape), then in a second phase used the doc values to confirm/reject
"potential" matches? Just a thought.

Unwinding a bit, these responses are exactly what I was looking for. As a
starting point, I wanted to understand if this had been explored, and it
looks like it certainly has—so thanks for all the pointers!

Cheers,
-Greg

On Tue, Jun 29, 2021 at 7:40 AM Adrien Grand <[email protected]> wrote:

> Hi Greg,
>
> Have you looked at IndexOrDocValuesQuery? It dynamically chooses between
> computing the range up-front using the BKD tree and running the range query
> using doc values depending on the estimated cost of the range query
> (computed by counting the number of leaf nodes of the BKD tree that have
> matches, which is cheap to compute) vs. the cost of the overall query.
> Hopefully the javadocs are not too bad, and I wrote a small blog
> <https://www.elastic.co/blog/better-query-planning-for-range-queries-in-elasticsearch>
> about this query a few years ago.
>
>
> On Tue, Jun 29, 2021 at 3:30 PM Greg Miller <[email protected]> wrote:
>
>> Hi folks-
>>
>> I've been spending a little time getting familiar with the BKD-tree-based
>> range query support currently implemented in Lucene, and wonder if there's
>> ever been a discussion around supporting two-phase iteration in this space.
>> If I'm understanding the current implementation properly (specifically
>> looking at PointRangeQuery), it appears that all matches are determined
>> up-front by 1) identifying segments of the tree that contain candidate
>> matches (i.e., containing part of the query range), and then 2) confirming
>> whether-or-not the contained points actually fall within the range. I'm
>> also a little low on coffee this morning so it's entirely possible I'm
>> misunderstanding the current implementation (please correct me if so).
>>
>> With this approach, it seems like we could potentially be doing quite a
>> bit of wasted effort in some situations. I have no thoughts on how to
>> actually implement this yet, but I wonder if we could support two-phase
>> iteration by 1) returning all docs with points contained in candidate
>> BKD-tree segments as an approximation, and then 2) only checking the points
>> against the query range when confirming matches in the second phase? I
>> think the idea would extend to LatLonPointDistanceQuery as well (and maybe
>> others?).
>>
>> I did a Jira search for a related issue but came up empty. Anyone know if
>> this idea has been discussed previously, or if there's some inherent flaw
>> with the approach that would make it a non-starter? I don't really have any
>> cycles to work on this at the moment, but can at least open a Jira issue to
>> track if it seems like a reasonable thing to explore.
>>
>> Cheers,
>> -Greg
>>
>
>
> --
> Adrien
>

Reply via email to