Deduplication/inversion for dimensional points

Michael McCandless Fri, 16 Jul 2021 05:42:16 -0700

Hi Team!

I know we have made recent strides so that if the same dimensional point
value (or N-dimensional point value) is indexed many, many times in one
segment, we try to somehow optimize for that case.  I think this happens
only at the leaf-block level, i.e. if all points in the block are
identical, we write a special header byte and the one value.


This is a massive reduction in index size, and a good speedup at search
time, since we only need to check if the value passes the query's shape
once, and can then collect all docids in that block.

Do we have any more deduping logic if many leaf blocks also share a single
value?  E.g. an inner node in the KD tree could note that all leaf blocks
under it have the same value?  And then somehow when intersecting we might
treat those leaf blocks (whose docids will be in postings order, right?) as
a strange postings list, maybe disjunctively inserted into something like a
disjunctive clause in a BooleanQuery?  I think I saw an issue talking about
something like this idea, but cannot find it now.

Context: we (Amazon Product Search team) are working out how we could use
Lucene's awesome geo features to help customers search/shop better :)  And
the simplest approach, index a lat/lon point on every "offer", would result
in many many duplicate lat/lons.

Thanks,

Mike McCandless

http://blog.mikemccandless.com

Deduplication/inversion for dimensional points

Reply via email to