On Wed, Feb 21, 2018 at 11:21 PM, Tomas Vondra
> On 02/21/2018 02:10 AM, Peter Geoghegan wrote:
>> On Tue, Feb 20, 2018 at 3:54 PM, Tomas Vondra
>> <tomas.von...@2ndquadrant.com> wrote:
>>>> I suspect that it could make sense to use a Bloom filter to
>>>> summarize the entire inner side of the join all at once, even
>>>> when there are multiple batches. I also suspect that this is
>>>> particularly beneficial with parallel hash joins, where
>>>> IPC/synchronization overhead can be a big problem.
>>> But that's what the patch does, currently - the filter is built
>>> during the initial pass through the data, and then used for all
>> I misunderstood. I would probably do something like double or triple
>> the original rows estimate instead, though. The estimate must be at
>> least slightly inaccurate when we get to this point, but I don't
>> think that that's a good enough reason to give up on the estimate
> That's a problem only for the multi-batch case, though.
> With a single batch we can walk the hash table and count non-empty
> buckets, to get a good ndistinct estimate cheaply. And then size the
> filter considering both memory requirements (fits into CPU cache) and
> false positive rate. There are other things we may need to consider
> (memory usage vs. work_mem) but that's a separate issue.
> With multiple batches I think we could use the "size the bloom filter
> for a fraction of work_mem" which the current patch uses when switching
> to multiple batches halfway-through. That pretty much entirely ignores
> the estimate and essentially replaces it with a "fictional" estimate.
> I think that's a better approach than using some arbitrary multiple of
> the estimate. When we have to start batching halfway through, the
> estimate is proven to be rather bogus anyway, but we may treat it as a
> lower boundary for the bloom filter size.
>> As I said, X should not be a portion of work_mem, because that has
>> only a weak relationship to what really matters.
> I agree a fixed fraction of work_mem may not be the right thing, but the
> goal was to make the bloom filter part of the Hash memory budget, i.e.
> bloom filter + hash table <= work_mem
> (which I think we agree should be the case), without increasing the
> number of batches too much. For example, if you size the filter ignoring
> this, and it end up being 90% of work_mem, you may need to do the hash
> join in 128 batches instead of just 16. Or something like that.
> Maybe that would still be a win, though. Firstly, the higher number of
> batches may not have a huge impact - in one case we need to serialie
> 15/16 and in the other one 127/128. That's 93% vs. 99%. And if the more
> accurate filter allows us to discard much more data from the outer
> relation ...
Let me reiterate, you can avoid both issues with scalable bloom filters.
An HLL can be used to estimate set size, the paper makes no mention of
it, probably assuming only distinct items are added to the set.