Re: MergeJoin beats HashJoin in the case of multiple hash clauses

Alexander Korotkov Sun, 16 Feb 2025 16:34:56 -0800

Hi, Andrei!

On Tue, Oct 8, 2024 at 8:00 AM Andrei Lepikhov <lepi...@gmail.com> wrote:
> On 7/8/24 19:45, Andrei Lepikhov wrote:
> > On 3/11/2023 23:43, Tomas Vondra wrote:
> >> On 9/11/23 10:04, Lepikhov Andrei wrote:
> >>   * Determine bucketsize fraction and MCV frequency for the inner
> >>   * relation. We use the smallest bucketsize or MCV frequency estimated
> >>   * for any individual hashclause; this is undoubtedly conservative.
> >>
> >> I'm sure this may lead to inflated cost for "good" cases (where the
> >> actual bucket size really is a product), which may push the optimizer to
> >> use the less efficient/slower join method.
> > Yes, It was contradictory idea, though.
> >> IMHO the only principled way forward is to get a better ndistinct
> >> estimate (which this implicitly does), perhaps by using extended
> >> statistics. I haven't tried, but I guess it'd need to extract the
> >> clauses for the inner side, and call estimate_num_groups() on it.
> > And I've done it. Sorry for so long response. This patch employs of
> > extended statistics for estimation of the HashJoin bucket_size. In
> > addition, I describe the idea in more convenient form here [1].
> > Obviously, it needs the only ndistinct to make a prediction that allows
> > to reduce computational cost of this statistic.
> Minor change to make cfbot happier.


Thank you for your work on this subject.  I agree with the general
direction.  While everyone has used conservative estimates for a long
time, it's better to change them only when we're sure about it.
However, I'm still not sure I get the conservatism.

if (innerbucketsize > thisbucketsize)
    innerbucketsize = thisbucketsize;
if (innermcvfreq > thismcvfreq)
   innermcvfreq = thismcvfreq;

IFAICS, even in the worst case (all columns are totally correlated),
the overall bucket size should be the smallest bucket size among
clauses (not the largest).  And the same is true of MCV.  As a mental
experiment, we can add a new clause to hash join, which is always true
because columns on both sides have the same value.  In fact, it would
have almost no influence except for the cost of extracting additional
columns and the cost of executing additional operators.  But in the
current model, this additional clause would completely ruin
thisbucketsize and thismcvfreq, making hash join extremely
unappealing.  Should we still revise this to calculate minimum instead
of maximum?

I've slightly revised the patch.  I've run pg_indent and renamed
s/saveList/origin_rinfos/g for better readability.

Also, the patch badly needs regression test coverage.  We can't
include costs in expected outputs.  But that could be some plans,
which previously were reliably merge joins but now become reliable
hash joins.

------
Regards,
Alexander Korotkov
Supabase

v2-0001-Use-extended-statistics-for-precise-estimation-of.patch
Description: Binary data

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

Reply via email to