Re: Hash Joins vs. Bloom Filters / take 2

Tomas Vondra Thu, 22 Feb 2018 13:15:29 -0800


On 02/22/2018 09:52 PM, Claudio Freire wrote:
> On Thu, Feb 22, 2018 at 5:11 PM, Tomas Vondra
> <[email protected]> wrote:
>> On 02/22/2018 08:33 PM, Claudio Freire wrote:
>>> That's kinda slow to do per-item. I tried to "count" distinct items by
>>> checking the BF before adding (don't add redundantly), but that's less
>>> precise than a HLL in my experience.
>>
>> But you don't need to do that for each item you try to add - you know
>> that with M bits and K hash functions you can't reach 50% before at
>> least M/(2*K) additions. And you don't really need to track the number
>> of bits exactly - if it gets 55% full, it's not going to explode.
>>
>> So just wait until M/(2*K) additions, see how many bits remain until the
>> threshold - rinse and repeat. And use some reasonable minimum distance
>> (say, 5% of the capacity), not to do the check too often.
> 
> Ok, that works too.
> 
> Point is, SBFs help to keep the BF size as small as possible, keep it
> below work_mem, and to avoid caring about the total number of distinct
> items.
> 
> You may want the planner to try and estimate that to figure out 
> whether it's worth trying BFs or not, but once you decide to try,
> SBFs remove any dependency on the accuracy of that estimate.
>


OK, thanks for reminding me about SBF and for the discussion.

At this point I'll probably focus on the other parts though -
determining selectivity of the join, etc. Which I think is crucial, and
we need to get that right even for accurate estimates. It's good to know
that we have a solution for that part, though.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Hash Joins vs. Bloom Filters / take 2

Reply via email to