Hey Mike,

I believe that we always handled the case when all documents in a leaf have
the same value efficiently. The case that got improved more recently is the
case when there are many duplicates within a leaf but still more than one
value. We added run-length encoding for this case (LUCENE-8868
<https://issues.apache.org/jira/browse/LUCENE-8868>) and added a new
optional API that allows checking the value only once to make queries more
efficient (LUCENE-8885 <https://issues.apache.org/jira/browse/LUCENE-8885>).
Kudos to Ignacio for this work.

To my knowledge, we don't have more deduplication logic. When an inner
block has a single value, the IntersectVisitor likely
returns CELL_INSIDE_QUERY and Lucene will only collect doc IDs for all leaf
blocks that are under this leaf block without doing any additional
comparison. This should be quite good already.

The fact that we have a visitor API makes it hard to actually treat these
doc IDs as a postings list even though they are in order. This is one
reason why I've been wondering if we should move to a different API (
LUCENE-9619 <https://issues.apache.org/jira/browse/LUCENE-9619>) where we
could explore the tree programmatically, which would hopefully allow us to
treat these doc IDs like a postings list. It wouldn't support skipping, but
at least we could avoid materializing a BitSet containing all doc IDs that
match this value. Not supporting skipping feels bad, but maybe it's not
actually that bad actually given that if you are using
IndexOrDocValuesQuery, the BKD tree will only be consulted if it leads
iteration, ie. we'd consume most of the matching doc IDs anyway. We could
record the min/max doc ID for each leaf block, but this feels like a major
change to the API for a very narrow use-case?

Also related, Ignacio has been asking if our BKD trees should encode doc
IDs more like postings: LUCENE-9368
<https://issues.apache.org/jira/browse/LUCENE-9368>.

On Fri, Jul 16, 2021 at 2:42 PM Michael McCandless <
[email protected]> wrote:

> Hi Team!
>
> I know we have made recent strides so that if the same dimensional point
> value (or N-dimensional point value) is indexed many, many times in one
> segment, we try to somehow optimize for that case.  I think this happens
> only at the leaf-block level, i.e. if all points in the block are
> identical, we write a special header byte and the one value.
>
> This is a massive reduction in index size, and a good speedup at search
> time, since we only need to check if the value passes the query's shape
> once, and can then collect all docids in that block.
>
> Do we have any more deduping logic if many leaf blocks also share a single
> value?  E.g. an inner node in the KD tree could note that all leaf blocks
> under it have the same value?  And then somehow when intersecting we might
> treat those leaf blocks (whose docids will be in postings order, right?) as
> a strange postings list, maybe disjunctively inserted into something like a
> disjunctive clause in a BooleanQuery?  I think I saw an issue talking about
> something like this idea, but cannot find it now.
>
> Context: we (Amazon Product Search team) are working out how we could use
> Lucene's awesome geo features to help customers search/shop better :)  And
> the simplest approach, index a lat/lon point on every "offer", would result
> in many many duplicate lat/lons.
>
> Thanks,
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>


-- 
Adrien

Reply via email to