Re: hashjoins vs. Bloom filters (yet again)

Oleg Bartunov Wed, 03 Jun 2026 02:21:16 -0700

On Sat, May 30, 2026, 01:56 Tomas Vondra <[email protected]> wrote:

> Hi,
>
> A random discussion at pgconf.dev made me revisit one of my ancient
> patches, attempting to use Bloom filters to hash joins. I did work on
> that twice in the past - first in 2015/6 [1], then in 2018 [2]. So let
> me briefly revisit that, before I get to the new patch.
>
>
> old patches
> -----------
>
> Those old patches tried to do a fairly small thing during a hash join,
> and that's building a Bloom filter on the inner relation (the one that
> gets hashed), and then use that filter before probing the hash table.
>
> The benefits come from Bloom filters being (fairly) cheap, and a
> negative answer (hash is not in the filter) may allows us to skip a much
> more expensive operation.
>
> The old threads patches focused especially at two hash join cases:
>
> (a) A very selective join, i.e. a significant fraction of outer tuples
> does not have a match in the hash table.
>
> (b) A selective hash join forced to do batching because the hash table
> is too large, and thus forced to spill outer tuples to temporary files.
>
> For (a), the benefit comes from Bloom filters being much cheaper to
> probe than a hash table. The exact cost depends on the implementation,
> sizes, etc. We're in the ballpark of 50 vs. 500 cycles, maybe. But if
> the filter discards 90% of tuples, it can be a big win.
>
> For (b), the filter (for all the batches at once) allows us to discard
> some of the outer tuples without writing them to temporary files. Which
> is way more expensive than probing a hash table.
>
> The patches got stuck mostly because deciding if it makes sense to
> build/use the Bloom filter is somewhat hard. For cases where 100% of the
> tuples have a match it's pointless - it's just pure cost, no benefit.
> The regressions are relatively small, though (<10%).
>
> For (b) it's much less sensitive to this kind of issues, of course. The
> cost of writing outer tuples to temporary files is much higher than
> building/probing a Bloom filter.
>
> Clearly, a filter that discards 99% of tuples is great. And a filter
> that keeps 99% of tuples is not great. But where exactly are the
> thresholds is not quite clear.
>
> There's also a related question of sizing the filter. Bloom filters are
> usually sized by specifying the number of distinct values and the
> desired false positive rate. And we could try doing that - pick a
> standard false positive rate (e.g. the built-in bloom_filter aims for
> 1-2%), estimate the ndistinct, and get the size of the Bloom filter.
>
> However, chances are the filter is too big. We can't get work_mem, the
> join is already using that for the hash table etc. We can maybe use a
> fraction of it, and that may not be enough to fit the "perfect" filter.
> We could bail out and not use any Bloom filter at all, but that seems a
> bit silly. Maybe we can't fit the 2% filter, but 5% of 10% would be OK?
>
> Surely if the join selectivity is 1% (i.e. it discards 99% tuples), then
> using a "worse" Bloom filter with 10% false positives would be a win?
> It'd still discard ~89% of tuples.
>
> Yet another angle leading to this kind of questions is inaccurate
> ndistinct estimates (and we all know those estimates can be quite
> unreliable). Let's say we size the filter for 1M distinct values (and it
> just about fits into the memory budget), but then during execution we
> find there are 2M distinct values. Well, now we may have ~10% false
> positive rate. Or maybe we got 5M, and it's 30%. Or 10M / 50%.
>
> At some point the filter stops being worth it, and we should either not
> build it, or we should stop probing it. But when is that?
>
> I think we'd need some sort of cost model to make judgments about this.
>
> Anyway, this was just me summarizing the old threads, and what I think
> got them stuck. Most of these questions are still open, although I think
> we may be able to solve them better than we could ~10 years ago. We have
> extended stats, we know about FK constraints during planning, ...
>
>
> new patch
> ---------
>
> Now let's talk about the new experimental/PoC patch that came from the
> pgconf.dev discussions. It doesn't really solve the issues I just went
> through, it's more of an attempt to take it one step further.
>
> One of the things mentioned in the 2018 thread was the possibility to
> push the filter much deeper, instead of using it just in the hash join
> node itself. It was merely discussed, but there was no code written, or
> anything like that. But it's the thing I decided to take a stab at after
> getting back from Vancouver.
>
> Consider a starjoin query
>
>   SELECT + FROM f JOIN d1 (f.id1 = d1.id)
>                   JOIN d2 (f.id2 = d2.id)
>                   JOIN d2 (f.id3 = d3.id)
>    WHERE d1.x = 1
>      AND d2.y = 2
>      AND d3.z = 3;
>
> which will be planned using a left-deep plan like this one:
>
>         HJ
>       /    \
>     D3     HJ
>          /    \
>         D2    HJ
>             /    \
>            D1     F
>
> With hashes on "D" tables, and a scan on "F". With the "old" patches,
> each HJ node would use a Bloom filter internally. But there's an
> interesting opportunity to "push down" the filters to the scan on "F",
> and evaluate them right there, a bit as if the scan had a local qual.
>
> The attached patch implements a PoC of this, and it's pretty effective.
>
> Of course, it depends on the selectivity of the joins (and thus how many
> tuples get discarded by the filters). But because it moves all the
> "cheap" filter probes *before* probing any of the hash tables, it has a
> multiplication effect for the benefits.
>
> Yes, it still has most of the open issues discussed earlier, and those
> will need to be addressed. But this "multiplication" may also make it
> somewhat less sensitive to the regressions.
>
> In the example above, if each of the 3 joins has 20% selectivity (i.e.
> 20% tuples go through), then the total selectivity is ~1%. So the "F"
> scan produces only 1/100 of tuples. Maybe we got one of the joins wrong,
> and it does not eliminate any tuples? That still means the overall
> selectivity is only ~4%.
>
> Of course, this only works for larger joins, and maybe the joins are
> correlated in some weird way, etc. Also, what does 4% selectivity mean
> for the overall query duration?
>
> Attached is a PDF with results from a simple benchmark using joins like
> the one above - fact + 1-3 dimensions. The scripts (in the .tgz) set a
> couple GUCs to eliminate variations in the plan. The dimension joins are
> independent and match a variable fraction of the fact (1% - 100%).
>
> The columns are for three branches - master, and "patched" with the
> push-down disabled and enabled, for joins with 1-3 dimensions.
>
> The last two column groups are comparing the "patched" results to
> master. With "off" there's no difference (other than random noise), just
> as expected. But with the push-down enabled, there are fairly
> significant speedups (up to ~3x). Of course, this is just a benchmark,
> practical queries may do other stuff, making the gains smaller. OTOH, it
> may also be much better, if there are expensive nodes in between.
>
>
> The PoC patch is not very big or complex. 280KB seems like a lot, but
> like 99% of that is changes in test output, because the patch adds some
> info about the Bloom filters to EXPLAIN. The actual .c changes are only
> ~1000 lines, and a half of that is comments.
>
> The most interesting stuff happens in create_hashjoin_plan(), where we
> attempt to push-down the filter to a scan in the outer subtree. If that
> succeeds, then ExecInitHashJoin initializes the filter so that the scan
> can find it, and Hash builds the filter along with the hash table. And
> then the scan nodes probe the pushed-down filter in ExecScanExtended().
>
> There's bunch of boilerplate so that setrefs does the right thing with
> expressions, etc. But it's a couple lines here and there. I'm actually
> surprised how little code this is.
>
> There's one detail I haven't mentioned yet - there's a simple adaptive
> behavior, to deal with filters that are not selective enough. Per some
> initial tests there's little benefit when the filter keeps >75% tuples,
> and for >90% there were measurable regressions (~50%). This was very
> consistent for different data types, etc.
>
> So the patch tracks number of matching tuples per 1000 probes, when it
> exceeds 90% it switches to sampling. Only 1% of tuples gets probed in
> the filter, and if the fraction drops <80%, all the tuples get probed
> again. This is very simple, needs more thought. But for the purpose of
> the testing it worked quite well. There still is a small regression
> (~3%), which I assume is due to building the filter.
>
>
> Aside from the issues with deciding if to use a filter at all, sizing
> it, etc. - which are still valid (even with the adaptive thing), and
> need to be solved, there's one more annoying issue specific to this new
> push-down stuff.
>
>
> Earlier, I mentioned the push-down happens in create_hashjoin_plan().
> Which means it happens *after* planning and costing. There are reasons
> for that, but it has some unfortunate & annoying consequences.
>
> Ideally, we'd know about the filters when constructing the scan nodes,
> so we'd have a chance to estimate how many tuples will be eliminated by
> probing the filters (which is about the same thing as estimating the
> join sizes). But we can't do that, because our planner works bottom-up.
> When constructing the scan nodes we know which tables we'll join with,
> but we have no idea which of the join algorithms we'll pick.
>
> We'll consider all three join types, and the scan node has no say which
> of those will win. But the Bloom filter push-down is specific to hash
> joins. So what should the scan node do? Either it can assume it's under
> hash join (and set rows/cost as if there's a Bloom filter), or it can
> set costs in a join-agnostic way (like now).
>
> The only "correct" way I can think of dealing with this in the bottom-up
> world is having two sets of paths - one set for a hash join, one set for
> other joins. But that's not just for scans. We'd need that for all
> paths, and for different combinations of joins. For the query with 3
> joins, we'd end up with 2^3 combinations. That seems not great.
>
>
> So I tend to see this as an opportunistic optimization. We do the
> planning assuming there's no Bloom filter push-down, and then after the
> fact we see if there's an opportunity after all. Which means we may not
> pick a plan with hash joins, not realizing it might be made faster.
>
> But in my mind that's somewhat acceptable / defensible.
>
> The bigger issue for me is that it may make the EXPLAIN ANALYZE output
> way harder to understand. The estimated "rows" are calculated before the
> filter push-down happens, while the actual "rows" are with the filter
> probing, of course. But it seems pretty easy to get confused by this,
> and think it's just an incorrect estimate.
>
>
> summary
> -------
>
> I like the idea of pushing filters down to the scan nodes (or perhaps
> even to some other intermediate nodes). But maybe it's too incompatible
> with our bottom-up planning, and the issues with costing and/or EXPLAIN
> output may be impossible to solve. I wonder what others think.
>
>
> Now that I revisited the older threads, I think it probably makes sense
> with using Bloom filters in the hash join, at least in the two cases
> mentioned in the first section. It doesn't have the issues with
> bottom-up planning/costing, because it happens in the hash join. And the
> issues with that (deciding what fractions are OK, sizing the filter,
> ...) apply to both that simpler case, and to the push-down.
>


Bloom filters have two rather different roles here.

For a local Hash Join optimization, Bloom does not require any particular
physical ordering of the heap. It can be useful simply when the join is
selective enough, or when batching/spilling makes failed probes expensive:
the Bloom filter rejects many outer tuples before a full hash-table probe
or before writing them to temporary batches.

But once we talk about pushing a runtime filter down to the scan/storage
layer, the physical preconditions become crucial. To get more than a cheap
per-row check, the scan must have something coarse-grained to skip:
partitions, row groups, chunks, block ranges, dictionaries, min/max
metadata, BRIN-like summaries, etc. Without that, the filter is still
correct, but the benefit is mostly CPU/probe reduction rather than avoiding
data production.

So for me the most interesting part of this thread is not Bloom itself, but
the architectural idea: pushing runtime knowledge down to the scan node,
against the normal direction of data flow. The build side of a join
produces compact knowledge about admissible keys, and lower layers may use
it before rows are materialized and sent upward.

I saw this in my own experiments with zone/chunk-oriented storage for
Postgres: static predicates could prune zones nicely, but joins were the
hard case because the useful filtering knowledge was produced above the
scan. A runtime semi-join filter pushed from the Hash Join build side into
the scan could turn join-derived knowledge into scan-level pruning.

For example:

  SELECT sum(e.cost)
  FROM events e
  JOIN accounts a ON e.account_id = a.id
  WHERE a.region = 'NP'; -- Nepal

The events scan does not know which account_id values are EU accounts. That
knowledge is produced above it, on the build side of the join. A runtime
semi-join filter pushed from the Hash Join build side down into the events
scan could let the scan reject impossible account_id values before
producing tuples.

For a plain heap scan this may mostly save hash probes. But with
zone/chunk-oriented storage, where chunks have dictionaries, min/max
metadata, Bloom summaries, or tenant ranges, the same runtime filter can
skip whole chunks. That is the part I find most interesting: turning
join-derived knowledge into scan-level pruning, against the normal
direction of data flow.

Bloom is just one carrier for that knowledge. The real feature is a
pluggable runtime-filter mechanism that heap, CustomScan, FDW,
columnar/table AMs, partitioned storage, or chunk/cold storage can consume
at the level they understand.

This may be a topic for a separate thread, because it quickly becomes less
about Hash Join Bloom filters and more about runtime knowledge pushdown
into storage.


> regards
>
>
> [1]
> https://www.postgresql.org/message-id/5670946E.8070705%402ndquadrant.com
>
> [2]
>
> https://www.postgresql.org/message-id/c902844d-837f-5f63-ced3-9f7fd222f175%402ndquadrant.com
>
> --
> Tomas Vondra
>

Re: hashjoins vs. Bloom filters (yet again)

Reply via email to