On Fri, 2013-03-29 at 16:37 -0400, Tom Lane wrote: 
> Jeff Davis <pg...@j-davis.com> writes:
> > Yes, I have run into this issue (or something very similar). I don't
> > understand why the bucketsize even matters much -- assuming few hash
> > collisions, we are not actually evaluating the quals any more times than
> > necessary. So why all of the hashjoin-specific logic in determining the
> > number of qual evaluations? The only reason I can think of is to model
> > the cost of comparing the hashes themselves.
> I think the point is that there may *not* be few hash collisions ...

In Stephen's case the table was only 41KB, so something still seems off.
Maybe we should model the likelihood of a collision based on the
cardinalities (assuming a reasonably good hash function)?

Also, I think I found an important assumption that seems dubious (in
comment for estimate_hash_bucketsize()):

"If the other relation in the join has a key distribution similar to
this one's, then the most-loaded buckets are exactly those that will be
probed most often.  Therefore, the "average" bucket size for costing
purposes should really be taken as something close to the "worst case"
bucket size.  We try to estimate this by adjusting the fraction if there
are too few distinct data values, and then scaling up by the ratio of
the most common value's frequency to the average frequency."

But the key distribution is not necessarily similar at all... the large
table might have many more distinct values.

Stephen, do you think this could explain your problem?

        Jeff Davis

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to