David Rowley <mailto:david.row...@2ndquadrant.com>
November 14, 2015 at 12:32 AM
The problem is that the optimizer is unable to use hash join or merge
joins when you have the IN() condition as the join condition, the
reason for this is that you're effectively saying to join on any of 3
conditions: settings.owner_id = mid.id1 OR settings.owner_id = mid.id2
OR settings.owner_id = mid.id3. If you think how a hash join works
then 1 hash table is no good here, as you effectively have 3 possible
keys for the hash table, the executor would have to build 3 tables to
make that possible, but we only ever build 1 in PostgreSQL. As you may
know, a hash table is a very efficient data structure for key value
lookups, but there can only be a single key. Merge join has the same
problem because it's only possible to have a single sort order.
Thanks, that's the key thing I was missing. I was expecting it to see
that there were exactly three conditions (as opposed to a variable
number) and evaluate "build 3 hash tables" as a possible execution plan.
Knowing that it doesn't do that completely explains the behavior I'm seeing.
It is puzzling that if, as suggested by someone else in the thread, I
expand IN(a,b,c) to (x = a OR x = b OR x = c) it gets substantially
faster, though still obviously falls afoul of the problem you describe
above (~4 seconds instead of ~6 seconds). Should those two be equivalent?
-Steve