Hi Team! I know we have made recent strides so that if the same dimensional point value (or N-dimensional point value) is indexed many, many times in one segment, we try to somehow optimize for that case. I think this happens only at the leaf-block level, i.e. if all points in the block are identical, we write a special header byte and the one value.
This is a massive reduction in index size, and a good speedup at search time, since we only need to check if the value passes the query's shape once, and can then collect all docids in that block. Do we have any more deduping logic if many leaf blocks also share a single value? E.g. an inner node in the KD tree could note that all leaf blocks under it have the same value? And then somehow when intersecting we might treat those leaf blocks (whose docids will be in postings order, right?) as a strange postings list, maybe disjunctively inserted into something like a disjunctive clause in a BooleanQuery? I think I saw an issue talking about something like this idea, but cannot find it now. Context: we (Amazon Product Search team) are working out how we could use Lucene's awesome geo features to help customers search/shop better :) And the simplest approach, index a lat/lon point on every "offer", would result in many many duplicate lat/lons. Thanks, Mike McCandless http://blog.mikemccandless.com