[
https://issues.apache.org/jira/browse/LUCENE-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-7401:
---------------------------------
Attachment: LUCENE-7401.patch
bq. That's a good example! In that case, with our current splitting, running a
range filter for "small population" will be costly. Though, without other
filters (by lat/lon) it will likely be costly anyway since town population is
probably Zipf's law like? I.e., most areas will still have many more small
population towns than big ones.
Right, a filter for "small population" is costly, but I don't think we can do
much about it since it is due to the fact it would match lots of documents. The
problem that would like to address here is the opposite: filtering for "large
population", for instance: "Find all cities in America that have more than 10K
inhabitants". This would be a selective filter, so hopefully something that we
can execute efficiently. However with the current way the splitting works, the
population dimension might never be indexed (because BKD would always decide to
index the lat or lon instead, which have larger ranges) and such a query, which
is very selective on the population dimension, would likely have to visit all
documents that match "America" to find matches, which is disappointing.
Here is a patch that implements what I had in mind when opening this ticket. It
ensures that every dimension gets split no less than 2x less than the dimension
that had the most splits. So back to our 3 dimensions example with lat, lon and
population, let's assume we have 1M docs, it would mean we need to split 10
times. In such a scenario, we would likely split 4 times on lat, 4 times on lon
and 2 times on the population dimension.
I think it is a better trade-off since it has a better worst case for selective
queries. For instance the fact that today the geo dimensions would get 10
splits means that a selective geo query would have to visit 1/1024th of the
index, but a selective query on population would have to perform a linear scan.
With this patch a selective geo query would have to visit 1/256th of the index
(8 splits), which is slower, however a selective query on the population
dimension would only have to visit 1/4th of the index (2 splits).
> BKDWriter should ensure all dimensions are indexed
> --------------------------------------------------
>
> Key: LUCENE-7401
> URL: https://issues.apache.org/jira/browse/LUCENE-7401
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-7401.patch
>
>
> The current heuristic is to use the dimension that has the largest span, so
> if dimensions have a different distribution of values, we could theoretically
> (but maybe in practice too?) end up with one dimension that is not indexed at
> all and queries that are mostly selective on this dimension would need to
> scan lots of blocks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]