Hey Mike, I believe that we always handled the case when all documents in a leaf have the same value efficiently. The case that got improved more recently is the case when there are many duplicates within a leaf but still more than one value. We added run-length encoding for this case (LUCENE-8868 <https://issues.apache.org/jira/browse/LUCENE-8868>) and added a new optional API that allows checking the value only once to make queries more efficient (LUCENE-8885 <https://issues.apache.org/jira/browse/LUCENE-8885>). Kudos to Ignacio for this work.
To my knowledge, we don't have more deduplication logic. When an inner block has a single value, the IntersectVisitor likely returns CELL_INSIDE_QUERY and Lucene will only collect doc IDs for all leaf blocks that are under this leaf block without doing any additional comparison. This should be quite good already. The fact that we have a visitor API makes it hard to actually treat these doc IDs as a postings list even though they are in order. This is one reason why I've been wondering if we should move to a different API ( LUCENE-9619 <https://issues.apache.org/jira/browse/LUCENE-9619>) where we could explore the tree programmatically, which would hopefully allow us to treat these doc IDs like a postings list. It wouldn't support skipping, but at least we could avoid materializing a BitSet containing all doc IDs that match this value. Not supporting skipping feels bad, but maybe it's not actually that bad actually given that if you are using IndexOrDocValuesQuery, the BKD tree will only be consulted if it leads iteration, ie. we'd consume most of the matching doc IDs anyway. We could record the min/max doc ID for each leaf block, but this feels like a major change to the API for a very narrow use-case? Also related, Ignacio has been asking if our BKD trees should encode doc IDs more like postings: LUCENE-9368 <https://issues.apache.org/jira/browse/LUCENE-9368>. On Fri, Jul 16, 2021 at 2:42 PM Michael McCandless < [email protected]> wrote: > Hi Team! > > I know we have made recent strides so that if the same dimensional point > value (or N-dimensional point value) is indexed many, many times in one > segment, we try to somehow optimize for that case. I think this happens > only at the leaf-block level, i.e. if all points in the block are > identical, we write a special header byte and the one value. > > This is a massive reduction in index size, and a good speedup at search > time, since we only need to check if the value passes the query's shape > once, and can then collect all docids in that block. > > Do we have any more deduping logic if many leaf blocks also share a single > value? E.g. an inner node in the KD tree could note that all leaf blocks > under it have the same value? And then somehow when intersecting we might > treat those leaf blocks (whose docids will be in postings order, right?) as > a strange postings list, maybe disjunctively inserted into something like a > disjunctive clause in a BooleanQuery? I think I saw an issue talking about > something like this idea, but cannot find it now. > > Context: we (Amazon Product Search team) are working out how we could use > Lucene's awesome geo features to help customers search/shop better :) And > the simplest approach, index a lat/lon point on every "offer", would result > in many many duplicate lat/lons. > > Thanks, > > Mike McCandless > > http://blog.mikemccandless.com > -- Adrien
