I wrote: > Perhaps what we should do is charge the hash_qual_cost only for some > small multiple of the number of tuples that we expect will *pass* the > hash quals, which is a number we have to compute anyway. The multiple > would represent the rate of hash-code collisions we expect.
I tried the attached quick-hack patch on Stephen's example. With work_mem set to 16MB I get these results: regression=# explain analyze select * from small_table join big_table using (id_short); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=1229.46..74154.49 rows=41176 width=24) (actual time=47.723..1845.869 rows=13731 loops=1) Hash Cond: (big_table.id_short = small_table.id_short) -> Seq Scan on big_table (cost=0.00..61626.71 rows=4272271 width=4) (actual time=0.045..506.212 rows=4272271 loops=1) -> Hash (cost=714.76..714.76 rows=41176 width=24) (actual time=24.944..24.944 rows=41176 loops=1) Buckets: 8192 Batches: 1 Memory Usage: 2574kB -> Seq Scan on small_table (cost=0.00..714.76 rows=41176 width=24) (actual time=0.007..11.608 rows=41176 loops=1) Total runtime: 1847.697 ms (7 rows) Forcing the other plan to be chosen, I get regression=# explain analyze select * from small_table join big_table using (id_short); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=131719.10..150327.44 rows=41176 width=24) (actual time=1922.942..2810.095 rows=13731 loops=1) Hash Cond: (small_table.id_short = big_table.id_short) -> Seq Scan on small_table (cost=0.00..714.76 rows=41176 width=24) (actual time=0.012..10.058 rows=41176 loops=1) -> Hash (cost=61626.71..61626.71 rows=4272271 width=4) (actual time=1921.962..1921.962 rows=4272271 loops=1) Buckets: 65536 Batches: 16 Memory Usage: 9412kB -> Seq Scan on big_table (cost=0.00..61626.71 rows=4272271 width=4) (actual time=0.043..702.898 rows=4272271 loops=1) Total runtime: 2820.633 ms (7 rows) So that's at least going in the right direction. I have not thought about how the calculation should be adjusted in the semi/anti join case, nor about how we ought to repurpose the bucket-size-variance calculations for checking whether work_mem will be exceeded. So this is a long way from being committable, but it seems promising. regards, tom lane
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index 8d24902..48b21cb 100644 *** a/src/backend/optimizer/path/costsize.c --- b/src/backend/optimizer/path/costsize.c *************** final_cost_hashjoin(PlannerInfo *root, H *** 2661,2685 **** else { /* - * The number of tuple comparisons needed is the number of outer - * tuples times the typical number of tuples in a hash bucket, which - * is the inner relation size times its bucketsize fraction. At each - * one, we need to evaluate the hashjoin quals. But actually, - * charging the full qual eval cost at each tuple is pessimistic, - * since we don't evaluate the quals unless the hash values match - * exactly. For lack of a better idea, halve the cost estimate to - * allow for that. - */ - startup_cost += hash_qual_cost.startup; - run_cost += hash_qual_cost.per_tuple * outer_path_rows * - clamp_row_est(inner_path_rows * innerbucketsize) * 0.5; - - /* * Get approx # tuples passing the hashquals. We use * approx_tuple_count here because we need an estimate done with * JOIN_INNER semantics. */ hashjointuples = approx_tuple_count(root, &path->jpath, hashclauses); } /* --- 2661,2692 ---- else { /* * Get approx # tuples passing the hashquals. We use * approx_tuple_count here because we need an estimate done with * JOIN_INNER semantics. */ hashjointuples = approx_tuple_count(root, &path->jpath, hashclauses); + + /* + * We assume that the hash equality operators will be applied to twice + * as many join pairs as actually pass the hashquals, that is there's + * about one hash-value collision per tuple. This is probably a + * substantial overestimate, but it seems wise to be conservative. + * Lacking any good model of the effectiveness of the hash functions, + * it's hard to do much better than a constant ratio. (XXX perhaps we + * ought to charge more than this for very large relations, since the + * hash values are only 32 bits wide and hence will suffer + * proportionally more collisions when there are many rows.) + * + * Note that we don't charge anything for visiting hash bucket entries + * that don't have a matching hash value. The cost per skipped bucket + * entry is not really zero, but it's a lot smaller than the cost of + * qual eval, so we ignore it. If you like you can think of the + * probable overestimate of the hash qual cost as partially accounting + * for this. + */ + startup_cost += hash_qual_cost.startup; + run_cost += hash_qual_cost.per_tuple * hashjointuples * 2.0; } /*
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers