On Sat, May 30, 2026, 01:56 Tomas Vondra <[email protected]> wrote: > Hi, > > A random discussion at pgconf.dev made me revisit one of my ancient > patches, attempting to use Bloom filters to hash joins. I did work on > that twice in the past - first in 2015/6 [1], then in 2018 [2]. So let > me briefly revisit that, before I get to the new patch. > > > old patches > ----------- > > Those old patches tried to do a fairly small thing during a hash join, > and that's building a Bloom filter on the inner relation (the one that > gets hashed), and then use that filter before probing the hash table. > > The benefits come from Bloom filters being (fairly) cheap, and a > negative answer (hash is not in the filter) may allows us to skip a much > more expensive operation. > > The old threads patches focused especially at two hash join cases: > > (a) A very selective join, i.e. a significant fraction of outer tuples > does not have a match in the hash table. > > (b) A selective hash join forced to do batching because the hash table > is too large, and thus forced to spill outer tuples to temporary files. > > For (a), the benefit comes from Bloom filters being much cheaper to > probe than a hash table. The exact cost depends on the implementation, > sizes, etc. We're in the ballpark of 50 vs. 500 cycles, maybe. But if > the filter discards 90% of tuples, it can be a big win. > > For (b), the filter (for all the batches at once) allows us to discard > some of the outer tuples without writing them to temporary files. Which > is way more expensive than probing a hash table. > > The patches got stuck mostly because deciding if it makes sense to > build/use the Bloom filter is somewhat hard. For cases where 100% of the > tuples have a match it's pointless - it's just pure cost, no benefit. > The regressions are relatively small, though (<10%). > > For (b) it's much less sensitive to this kind of issues, of course. The > cost of writing outer tuples to temporary files is much higher than > building/probing a Bloom filter. > > Clearly, a filter that discards 99% of tuples is great. And a filter > that keeps 99% of tuples is not great. But where exactly are the > thresholds is not quite clear. > > There's also a related question of sizing the filter. Bloom filters are > usually sized by specifying the number of distinct values and the > desired false positive rate. And we could try doing that - pick a > standard false positive rate (e.g. the built-in bloom_filter aims for > 1-2%), estimate the ndistinct, and get the size of the Bloom filter. > > However, chances are the filter is too big. We can't get work_mem, the > join is already using that for the hash table etc. We can maybe use a > fraction of it, and that may not be enough to fit the "perfect" filter. > We could bail out and not use any Bloom filter at all, but that seems a > bit silly. Maybe we can't fit the 2% filter, but 5% of 10% would be OK? > > Surely if the join selectivity is 1% (i.e. it discards 99% tuples), then > using a "worse" Bloom filter with 10% false positives would be a win? > It'd still discard ~89% of tuples. > > Yet another angle leading to this kind of questions is inaccurate > ndistinct estimates (and we all know those estimates can be quite > unreliable). Let's say we size the filter for 1M distinct values (and it > just about fits into the memory budget), but then during execution we > find there are 2M distinct values. Well, now we may have ~10% false > positive rate. Or maybe we got 5M, and it's 30%. Or 10M / 50%. > > At some point the filter stops being worth it, and we should either not > build it, or we should stop probing it. But when is that? > > I think we'd need some sort of cost model to make judgments about this. > > Anyway, this was just me summarizing the old threads, and what I think > got them stuck. Most of these questions are still open, although I think > we may be able to solve them better than we could ~10 years ago. We have > extended stats, we know about FK constraints during planning, ... > > > new patch > --------- > > Now let's talk about the new experimental/PoC patch that came from the > pgconf.dev discussions. It doesn't really solve the issues I just went > through, it's more of an attempt to take it one step further. > > One of the things mentioned in the 2018 thread was the possibility to > push the filter much deeper, instead of using it just in the hash join > node itself. It was merely discussed, but there was no code written, or > anything like that. But it's the thing I decided to take a stab at after > getting back from Vancouver. > > Consider a starjoin query > > SELECT + FROM f JOIN d1 (f.id1 = d1.id) > JOIN d2 (f.id2 = d2.id) > JOIN d2 (f.id3 = d3.id) > WHERE d1.x = 1 > AND d2.y = 2 > AND d3.z = 3; > > which will be planned using a left-deep plan like this one: > > HJ > / \ > D3 HJ > / \ > D2 HJ > / \ > D1 F > > With hashes on "D" tables, and a scan on "F". With the "old" patches, > each HJ node would use a Bloom filter internally. But there's an > interesting opportunity to "push down" the filters to the scan on "F", > and evaluate them right there, a bit as if the scan had a local qual. > > The attached patch implements a PoC of this, and it's pretty effective. > > Of course, it depends on the selectivity of the joins (and thus how many > tuples get discarded by the filters). But because it moves all the > "cheap" filter probes *before* probing any of the hash tables, it has a > multiplication effect for the benefits. > > Yes, it still has most of the open issues discussed earlier, and those > will need to be addressed. But this "multiplication" may also make it > somewhat less sensitive to the regressions. > > In the example above, if each of the 3 joins has 20% selectivity (i.e. > 20% tuples go through), then the total selectivity is ~1%. So the "F" > scan produces only 1/100 of tuples. Maybe we got one of the joins wrong, > and it does not eliminate any tuples? That still means the overall > selectivity is only ~4%. > > Of course, this only works for larger joins, and maybe the joins are > correlated in some weird way, etc. Also, what does 4% selectivity mean > for the overall query duration? > > Attached is a PDF with results from a simple benchmark using joins like > the one above - fact + 1-3 dimensions. The scripts (in the .tgz) set a > couple GUCs to eliminate variations in the plan. The dimension joins are > independent and match a variable fraction of the fact (1% - 100%). > > The columns are for three branches - master, and "patched" with the > push-down disabled and enabled, for joins with 1-3 dimensions. > > The last two column groups are comparing the "patched" results to > master. With "off" there's no difference (other than random noise), just > as expected. But with the push-down enabled, there are fairly > significant speedups (up to ~3x). Of course, this is just a benchmark, > practical queries may do other stuff, making the gains smaller. OTOH, it > may also be much better, if there are expensive nodes in between. > > > The PoC patch is not very big or complex. 280KB seems like a lot, but > like 99% of that is changes in test output, because the patch adds some > info about the Bloom filters to EXPLAIN. The actual .c changes are only > ~1000 lines, and a half of that is comments. > > The most interesting stuff happens in create_hashjoin_plan(), where we > attempt to push-down the filter to a scan in the outer subtree. If that > succeeds, then ExecInitHashJoin initializes the filter so that the scan > can find it, and Hash builds the filter along with the hash table. And > then the scan nodes probe the pushed-down filter in ExecScanExtended(). > > There's bunch of boilerplate so that setrefs does the right thing with > expressions, etc. But it's a couple lines here and there. I'm actually > surprised how little code this is. > > There's one detail I haven't mentioned yet - there's a simple adaptive > behavior, to deal with filters that are not selective enough. Per some > initial tests there's little benefit when the filter keeps >75% tuples, > and for >90% there were measurable regressions (~50%). This was very > consistent for different data types, etc. > > So the patch tracks number of matching tuples per 1000 probes, when it > exceeds 90% it switches to sampling. Only 1% of tuples gets probed in > the filter, and if the fraction drops <80%, all the tuples get probed > again. This is very simple, needs more thought. But for the purpose of > the testing it worked quite well. There still is a small regression > (~3%), which I assume is due to building the filter. > > > Aside from the issues with deciding if to use a filter at all, sizing > it, etc. - which are still valid (even with the adaptive thing), and > need to be solved, there's one more annoying issue specific to this new > push-down stuff. > > > Earlier, I mentioned the push-down happens in create_hashjoin_plan(). > Which means it happens *after* planning and costing. There are reasons > for that, but it has some unfortunate & annoying consequences. > > Ideally, we'd know about the filters when constructing the scan nodes, > so we'd have a chance to estimate how many tuples will be eliminated by > probing the filters (which is about the same thing as estimating the > join sizes). But we can't do that, because our planner works bottom-up. > When constructing the scan nodes we know which tables we'll join with, > but we have no idea which of the join algorithms we'll pick. > > We'll consider all three join types, and the scan node has no say which > of those will win. But the Bloom filter push-down is specific to hash > joins. So what should the scan node do? Either it can assume it's under > hash join (and set rows/cost as if there's a Bloom filter), or it can > set costs in a join-agnostic way (like now). > > The only "correct" way I can think of dealing with this in the bottom-up > world is having two sets of paths - one set for a hash join, one set for > other joins. But that's not just for scans. We'd need that for all > paths, and for different combinations of joins. For the query with 3 > joins, we'd end up with 2^3 combinations. That seems not great. > > > So I tend to see this as an opportunistic optimization. We do the > planning assuming there's no Bloom filter push-down, and then after the > fact we see if there's an opportunity after all. Which means we may not > pick a plan with hash joins, not realizing it might be made faster. > > But in my mind that's somewhat acceptable / defensible. > > The bigger issue for me is that it may make the EXPLAIN ANALYZE output > way harder to understand. The estimated "rows" are calculated before the > filter push-down happens, while the actual "rows" are with the filter > probing, of course. But it seems pretty easy to get confused by this, > and think it's just an incorrect estimate. > > > summary > ------- > > I like the idea of pushing filters down to the scan nodes (or perhaps > even to some other intermediate nodes). But maybe it's too incompatible > with our bottom-up planning, and the issues with costing and/or EXPLAIN > output may be impossible to solve. I wonder what others think. > > > Now that I revisited the older threads, I think it probably makes sense > with using Bloom filters in the hash join, at least in the two cases > mentioned in the first section. It doesn't have the issues with > bottom-up planning/costing, because it happens in the hash join. And the > issues with that (deciding what fractions are OK, sizing the filter, > ...) apply to both that simpler case, and to the push-down. >
Bloom filters have two rather different roles here. For a local Hash Join optimization, Bloom does not require any particular physical ordering of the heap. It can be useful simply when the join is selective enough, or when batching/spilling makes failed probes expensive: the Bloom filter rejects many outer tuples before a full hash-table probe or before writing them to temporary batches. But once we talk about pushing a runtime filter down to the scan/storage layer, the physical preconditions become crucial. To get more than a cheap per-row check, the scan must have something coarse-grained to skip: partitions, row groups, chunks, block ranges, dictionaries, min/max metadata, BRIN-like summaries, etc. Without that, the filter is still correct, but the benefit is mostly CPU/probe reduction rather than avoiding data production. So for me the most interesting part of this thread is not Bloom itself, but the architectural idea: pushing runtime knowledge down to the scan node, against the normal direction of data flow. The build side of a join produces compact knowledge about admissible keys, and lower layers may use it before rows are materialized and sent upward. I saw this in my own experiments with zone/chunk-oriented storage for Postgres: static predicates could prune zones nicely, but joins were the hard case because the useful filtering knowledge was produced above the scan. A runtime semi-join filter pushed from the Hash Join build side into the scan could turn join-derived knowledge into scan-level pruning. For example: SELECT sum(e.cost) FROM events e JOIN accounts a ON e.account_id = a.id WHERE a.region = 'NP'; -- Nepal The events scan does not know which account_id values are EU accounts. That knowledge is produced above it, on the build side of the join. A runtime semi-join filter pushed from the Hash Join build side down into the events scan could let the scan reject impossible account_id values before producing tuples. For a plain heap scan this may mostly save hash probes. But with zone/chunk-oriented storage, where chunks have dictionaries, min/max metadata, Bloom summaries, or tenant ranges, the same runtime filter can skip whole chunks. That is the part I find most interesting: turning join-derived knowledge into scan-level pruning, against the normal direction of data flow. Bloom is just one carrier for that knowledge. The real feature is a pluggable runtime-filter mechanism that heap, CustomScan, FDW, columnar/table AMs, partitioned storage, or chunk/cold storage can consume at the level they understand. This may be a topic for a separate thread, because it quickly becomes less about Hash Join Bloom filters and more about runtime knowledge pushdown into storage. > regards > > > [1] > https://www.postgresql.org/message-id/5670946E.8070705%402ndquadrant.com > > [2] > > https://www.postgresql.org/message-id/c902844d-837f-5f63-ced3-9f7fd222f175%402ndquadrant.com > > -- > Tomas Vondra >
