Re: [HACKERS] WIP: bloom filter in Hash Joins with batches

Tomas Vondra Sun, 10 Jan 2016 13:12:28 -0800

Hi,

On 01/10/2016 05:11 AM, Peter Geoghegan wrote:

On Sat, Jan 9, 2016 at 11:02 AM, Tomas Vondra
<[email protected]> wrote:

Which means the "dim.r" column has 100 different values (0-99) with uniform
distribution. So e.g. "WHERE r < 15" matches 15%.


I think that the use of a uniform distribution to demonstrate this
patch is a bad idea, unless you want to have a conversation about the
worst case.

Look at the use cases for bloom filters in general. They're almost
all some variation on the same theme: checking a bloom filter
inexpensively, to avoid an expensive cache miss (of some fashion).


Right.

The Google Chrome web browser uses a bloom filter for malicious URLs.

FWIW I don't think Chrome is using bloom filter for this purposeanymore. Chromium certainly does not (it's using PrefixSet instead).

It usually avoids consulting Google's servers about whether or not
any given URL that is visited is malicious by using the bloom filter.
This is effective in large part because the top 100 websites ranked
by popularity have a huge falloff in popularity as you go down the
list. It looks like a Zipfian distribution, and so I imagine they get
pretty far with a very small bloom filter, even though in theory the
chances of any given valid URL resulting in a cache miss is very
high.

I'm not familiar with how Chrome used bloom filters, but I'd expect themto be very careful about false positives (and also false negatives, asthe bloom filter can't contain all malicious URLs).

My assumptions is that they've been able to make that work because theydo have detailed stats about how frequently people visit those URLs, anduse that when building the bloom filter. But we don't have suchinformation in hashjoin.


Generally, uniform distributions are rare in a non-canonical list of
things, like a hash join outer relation's attribute.

Well, I'm not claiming testing uniform distribution is enough, butsurely it's one of the cases we should handle just fine.

The problem with non-uniform cases is that it really depends on theouter side of the join.

For example let's say the hash table contains 1000 values, and the bloomfilter has 1% false positive rate. But let's assume that the outer sidehas a value that triggers the false positive rate, and that it'sactually 99% of the outer table. Suddenly, you have 99% false positiverate, rendering the bloom filter pointless.

I don't think this is fixable while creating the bloom filter. All wecan do is watch the bloom filter lookups and disable the bloom filteronce the false positive rate reaches some threshold.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: bloom filter in Hash Joins with batches

Reply via email to