I wrote:
> Perhaps what we should do is charge the hash_qual_cost only for some
> small multiple of the number of tuples that we expect will *pass* the
> hash quals, which is a number we have to compute anyway.  The multiple
> would represent the rate of hash-code collisions we expect.

I tried the attached quick-hack patch on Stephen's example.  With
work_mem set to 16MB I get these results:

regression=# explain analyze select * from small_table join big_table using 
(id_short);
                                                         QUERY PLAN             
                                             
-----------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1229.46..74154.49 rows=41176 width=24) (actual 
time=47.723..1845.869 rows=13731 loops=1)
   Hash Cond: (big_table.id_short = small_table.id_short)
   ->  Seq Scan on big_table  (cost=0.00..61626.71 rows=4272271 width=4) 
(actual time=0.045..506.212 rows=4272271 loops=1)
   ->  Hash  (cost=714.76..714.76 rows=41176 width=24) (actual 
time=24.944..24.944 rows=41176 loops=1)
         Buckets: 8192  Batches: 1  Memory Usage: 2574kB
         ->  Seq Scan on small_table  (cost=0.00..714.76 rows=41176 width=24) 
(actual time=0.007..11.608 rows=41176 loops=1)
 Total runtime: 1847.697 ms
(7 rows)

Forcing the other plan to be chosen, I get

regression=# explain analyze select * from small_table join big_table using 
(id_short);
                                                           QUERY PLAN           
                                                 
---------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=131719.10..150327.44 rows=41176 width=24) (actual 
time=1922.942..2810.095 rows=13731 loops=1)
   Hash Cond: (small_table.id_short = big_table.id_short)
   ->  Seq Scan on small_table  (cost=0.00..714.76 rows=41176 width=24) (actual 
time=0.012..10.058 rows=41176 loops=1)
   ->  Hash  (cost=61626.71..61626.71 rows=4272271 width=4) (actual 
time=1921.962..1921.962 rows=4272271 loops=1)
         Buckets: 65536  Batches: 16  Memory Usage: 9412kB
         ->  Seq Scan on big_table  (cost=0.00..61626.71 rows=4272271 width=4) 
(actual time=0.043..702.898 rows=4272271 loops=1)
 Total runtime: 2820.633 ms
(7 rows)

So that's at least going in the right direction.

I have not thought about how the calculation should be adjusted in the
semi/anti join case, nor about how we ought to repurpose the
bucket-size-variance calculations for checking whether work_mem will be
exceeded.  So this is a long way from being committable, but it seems
promising.

                        regards, tom lane

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8d24902..48b21cb 100644
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** final_cost_hashjoin(PlannerInfo *root, H
*** 2661,2685 ****
  	else
  	{
  		/*
- 		 * The number of tuple comparisons needed is the number of outer
- 		 * tuples times the typical number of tuples in a hash bucket, which
- 		 * is the inner relation size times its bucketsize fraction.  At each
- 		 * one, we need to evaluate the hashjoin quals.  But actually,
- 		 * charging the full qual eval cost at each tuple is pessimistic,
- 		 * since we don't evaluate the quals unless the hash values match
- 		 * exactly.  For lack of a better idea, halve the cost estimate to
- 		 * allow for that.
- 		 */
- 		startup_cost += hash_qual_cost.startup;
- 		run_cost += hash_qual_cost.per_tuple * outer_path_rows *
- 			clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;
- 
- 		/*
  		 * Get approx # tuples passing the hashquals.  We use
  		 * approx_tuple_count here because we need an estimate done with
  		 * JOIN_INNER semantics.
  		 */
  		hashjointuples = approx_tuple_count(root, &path->jpath, hashclauses);
  	}
  
  	/*
--- 2661,2692 ----
  	else
  	{
  		/*
  		 * Get approx # tuples passing the hashquals.  We use
  		 * approx_tuple_count here because we need an estimate done with
  		 * JOIN_INNER semantics.
  		 */
  		hashjointuples = approx_tuple_count(root, &path->jpath, hashclauses);
+ 
+ 		/*
+ 		 * We assume that the hash equality operators will be applied to twice
+ 		 * as many join pairs as actually pass the hashquals, that is there's
+ 		 * about one hash-value collision per tuple.  This is probably a
+ 		 * substantial overestimate, but it seems wise to be conservative.
+ 		 * Lacking any good model of the effectiveness of the hash functions,
+ 		 * it's hard to do much better than a constant ratio.  (XXX perhaps we
+ 		 * ought to charge more than this for very large relations, since the
+ 		 * hash values are only 32 bits wide and hence will suffer
+ 		 * proportionally more collisions when there are many rows.)
+ 		 *
+ 		 * Note that we don't charge anything for visiting hash bucket entries
+ 		 * that don't have a matching hash value.  The cost per skipped bucket
+ 		 * entry is not really zero, but it's a lot smaller than the cost of
+ 		 * qual eval, so we ignore it.  If you like you can think of the
+ 		 * probable overestimate of the hash qual cost as partially accounting
+ 		 * for this.
+ 		 */
+ 		startup_cost += hash_qual_cost.startup;
+ 		run_cost += hash_qual_cost.per_tuple * hashjointuples * 2.0;
  	}
  
  	/*
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to