Re: [PERFORM] HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?

Scott Carey Wed, 27 Oct 2010 12:24:52 -0700

On Oct 26, 2010, at 8:48 PM, Tom Lane wrote:

> Robert Haas <robertmh...@gmail.com> writes:
>> I'm also a bit suspicious of the fact that the hash condition has a
>> cast to text on both sides, which implies, to me anyway, that the
>> underlying data types are not text.  That might mean that the query
>> planner doesn't have very good statistics, which might mean that the
>> join selectivity estimates are wackadoo, which can apparently cause
>> this problem:
> 
> Um ... you're guilty of the same thing as the OP, ie not showing how
> you got this example.  But I'm guessing that it was something like
> 
> create table little as select * from generate_series(1,10) a;
> create table big as select * from generate_series(1,100000) a;
> ... wait for auto-analyze of big ...
> explain select * from little, big where little.a = big.a;
> 
> Here, big is large enough to prod autovacuum into analyzing it,
> whereas little isn't.  So when the planner runs, it sees
> 
> (1) big is known to have 100000 rows, and big.a is known unique;
> (2) little is estimated to have many fewer rows, but nothing is
>    known about the distribution of little.a.
> 
> In this situation, it's going to prefer to hash big, because hash join
> behaves pretty nicely when the inner rel is uniformly distributed and
> the outer not, but not nicely at all when it's the other way round.
> It'll change its mind as soon as you analyze little, but it doesn't
> like taking a chance on an unknown distribution.  See cost_hashjoin
> and particularly estimate_hash_bucketsize.


The type of hash table will make a difference in how it behaves with skew.  
Open addressing versus linking, etc. 

> 
> I'm not convinced this explains Scott's results though --- the numbers
> he's showing don't seem to add up even if you assume a pretty bad
> distribution for the smaller rel.

Answering both messages partially:

The join is on varchar columns.  So no they are cast ::text because its from 
two slightly different varchar declarations to ::text.

The small relation in this case is unique on the key.  But I think postgres 
doesn't know that because there is a unique index on:

(id, name) and the query filters for id = 12 in the example, leaving name 
unique.  But postgres does not identify this as a unique condition for the key. 
 However, there are only about 150 distinct values of id, so even if 'name' 
collides sometimes across ids, there can be no more than 150 values that map to 
one key.

I gave a partial plan, the parent joins are sometimes anti-joins.  The general 
query form is two from the temp table:
An update to a main table where the unique index keys match the temp (inner 
join)
An insert into the main table where the unique index keys do not exist (NOT 
EXISTS query, results in anti-join).  The large relation is the one being 
inserted/updated into the main table.  
Both have the same subplan that takes all the time in the middle -- a hash that 
hashes the large table and probes from the small side.  I am still completely 
confused on how the hashjoin is calculating costs.  In one of my examples it 
seems to be way off.  


Why does hashjoin behave poorly when the inner relation is not uniformly 
distributed and the outer is?  In particular for anti-join this should be the 
best case scenario.

My assumption is this:  Imagine the worst possible skew on the small table.  
Every value is in the same key and there are 20,000 entries in one list under 
one key.  The large table is in the outer relation, and probes for every 
element and has uniform distribution, 20 million -- one thousand rows for each 
of 20,000 keys.  
Inner Join: 
  The values that match the key join agains the list.  There is no faster way 
to join once the list is identified.  It is essentially nested loops per 
matching key.  1000 matches each output 20,000 rows.  If the hashjoin was 
reversed, then the inner relation would be the large table and the outer 
relation would be the small table.  There would be 20,000 matches that each 
output 1000 rows.
Semi-Join:
  The same thing happens, but additionally rows that match nothing are emitted. 
 If the relation that is always kept is the large one, it makes more sense to 
hash the small.
Anti-Join:
  Two relations, I'll call one "select" the other is "not exists".  The 
"select" relation is the one we are keeping, but only if its key does not match 
any key in the "not exists" relation.
Case 1:  The large uniform table is "select" and the small one "not exists"
  It is always optimal to have the small one as the inner hashed relation, no 
matter what the skew.  If the inner relation is the "not exists" table, only 
the key needs to be kept in the inner relation hash, the values or number of 
matches do not matter, so skew is irrelevant and actually makes it faster than 
even distribution of keys.
Case 2:  The small relation is the "select"
  Using the large "not exists" relation as the inner relation works well, as 
only the existence of a key needs to be kept.  Hashing the small "select" 
relation works if it is small enough.  Whenever a key matches from the outer 
relation, remove it from the hash, then at the end take what remains in the 
hash as the result.  This is also generally immune to skew.  

Am I missing something?  Is the Hash that is built storing each tuple in an 
open-addressed entry, and thus sensitive to key-value skew? or something like 
that?  If the hash only has one entry per key, with a linked list of values 
that match the key, I don't see how skew is a factor for hashjoin.  I am 
probably missing something.


> 
>                       regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?

Reply via email to