[ 
https://issues.apache.org/jira/browse/LUCENE-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-7401:
---------------------------------
    Attachment: LUCENE-7401.patch

bq. That's a good example! In that case, with our current splitting, running a 
range filter for "small population" will be costly. Though, without other 
filters (by lat/lon) it will likely be costly anyway since town population is 
probably Zipf's law like? I.e., most areas will still have many more small 
population towns than big ones.

Right, a filter for "small population" is costly, but I don't think we can do 
much about it since it is due to the fact it would match lots of documents. The 
problem that would like to address here is the opposite: filtering for "large 
population", for instance: "Find all cities in America that have more than 10K 
inhabitants". This would be a selective filter, so hopefully something that we 
can execute efficiently. However with the current way the splitting works, the 
population dimension might never be indexed (because BKD would always decide to 
index the lat or lon instead, which have larger ranges) and such a query, which 
is very selective on the population dimension, would likely have to visit all 
documents that match "America" to find matches, which is disappointing.

Here is a patch that implements what I had in mind when opening this ticket. It 
ensures that every dimension gets split no less than 2x less than the dimension 
that had the most splits. So back to our 3 dimensions example with lat, lon and 
population, let's assume we have 1M docs, it would mean we need to split 10 
times. In such a scenario, we would likely split 4 times on lat, 4 times on lon 
and 2 times on the population dimension.

I think it is a better trade-off since it has a better worst case for selective 
queries. For instance the fact that today the geo dimensions would get 10 
splits means that a selective geo query would have to visit 1/1024th of the 
index, but a selective query on population would have to perform a linear scan. 
With this patch a selective geo query would have to visit 1/256th of the index 
(8 splits), which is slower, however a selective query on the population 
dimension would only have to visit 1/4th of the index (2 splits).

> BKDWriter should ensure all dimensions are indexed
> --------------------------------------------------
>
>                 Key: LUCENE-7401
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7401
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-7401.patch
>
>
> The current heuristic is to use the dimension that has the largest span, so 
> if dimensions have a different distribution of values, we could theoretically 
> (but maybe in practice too?) end up with one dimension that is not indexed at 
> all and queries that are mostly selective on this dimension would need to 
> scan lots of blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to