[GitHub] clintropolis commented on issue #3878: Filters on high cardinality dimensions should sometimes use dim index bitset + full scan instead of unioning bitsets of dim values

GitBox Tue, 19 Feb 2019 23:37:43 -0800

clintropolis commented on issue #3878: Filters on high cardinality dimensions 
should sometimes use dim index bitset + full scan instead of unioning bitsets 
of dim values
URL: 
https://github.com/apache/incubator-druid/issues/3878#issuecomment-465459575
 
 
   I've also been lightly experimenting with this, as an offshoot of #6633, to 
try to get enough of a feel for things to put together a proposal for a more 
generic solution than in that PR. But I have been rather side tracked with some 
other things and haven't got back to it yet, so I can at least share my 
thoughts so far in case they are useful to you or someone else wants to run 
with it before I can pick it back up.
   
   Besides selectivity, my limited experiments lead me to think that the type 
of filter can also drive whether or not a filter should be done as a pre or a 
post filter, since not all filter value matching is equal. Particularly, things 
like bloom filters which need to compute a hash per value and test each hash 
against the bloom filter, the match itself is very expensive for high 
cardinality dimensions before it even gets to combining the bitmaps. #6633 
illustrates that it can be useful in that case to always push to a post filter 
if the cardinality is high, even if the filter is highly selective.
   
   The hard part of this then would be figuring out how to encode this somehow 
so that the threshold can be a function of how expensive the filter is, 
assuming such classification is possible or useful in practice and that other 
filters exhibit a similar patterns to what I was observing with bloom filters 
which was my main focus so far. By side effect, introducing the idea that 
filters have some sort of cost value additionally opens up another optimization 
for evaluating 'and' filters, by giving a sensible mechanism to control the 
order in which filters are evaluated so that cheaper filters can be evaluated 
first and non-matches potentially shake out faster.
   
   It looks like some effort has gone into selectivity estimation for use with 
`search` queries in the past, but my gut tells me many of these estimates look 
fairly expensive, so i'm unsure of their utility for all query types. I have 
not verified this however, nor am I super familiar with these estimators, so 
it's possible I'm wrong. I was hoping experimentation would shake out whether 
or not knowing filter selectivity directly is even necessary. It might be well 
enough that we know that certain types of filters (selectors and combinations 
of selectors) should rarely if ever not be pre-filters, while others should 
generally always skip bitmaps if the cardinality is high enough, perhaps with 
the threshold adjusted based on how expensive the filter is or not. However, I 
suspect it's not actually that simple either, but hard to say without further 
experiments.
   
   I don't really know what this looks like yet implementation wise, too many 
unknowns at this point. All testing I have done so far has been very manual. 
For my next steps, I was planning on setting up a test harness that would let 
me manually control whether or not filters should use bitmap indexes similar to 
#6633, but maybe as an option to all filters, and then benchmark with 
parameters to run under a variety of conditions to find if there is indeed a 
'break even' point and how it varies between filter types, overall dimension 
cardinality, and filter selectivity. But like I said, I'm not sure when I'll 
get to this, so if this interests you, I say have at it and I'll be more than 
happy to support with further discussion or review instead in order to see this 
get done :+1:


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] clintropolis commented on issue #3878: Filters on high cardinality dimensions should sometimes use dim index bitset + full scan instead of unioning bitsets of dim values

Reply via email to