Hi, Andrei! On Tue, Oct 8, 2024 at 8:00 AM Andrei Lepikhov <lepi...@gmail.com> wrote: > On 7/8/24 19:45, Andrei Lepikhov wrote: > > On 3/11/2023 23:43, Tomas Vondra wrote: > >> On 9/11/23 10:04, Lepikhov Andrei wrote: > >> * Determine bucketsize fraction and MCV frequency for the inner > >> * relation. We use the smallest bucketsize or MCV frequency estimated > >> * for any individual hashclause; this is undoubtedly conservative. > >> > >> I'm sure this may lead to inflated cost for "good" cases (where the > >> actual bucket size really is a product), which may push the optimizer to > >> use the less efficient/slower join method. > > Yes, It was contradictory idea, though. > >> IMHO the only principled way forward is to get a better ndistinct > >> estimate (which this implicitly does), perhaps by using extended > >> statistics. I haven't tried, but I guess it'd need to extract the > >> clauses for the inner side, and call estimate_num_groups() on it. > > And I've done it. Sorry for so long response. This patch employs of > > extended statistics for estimation of the HashJoin bucket_size. In > > addition, I describe the idea in more convenient form here [1]. > > Obviously, it needs the only ndistinct to make a prediction that allows > > to reduce computational cost of this statistic. > Minor change to make cfbot happier.
Thank you for your work on this subject. I agree with the general direction. While everyone has used conservative estimates for a long time, it's better to change them only when we're sure about it. However, I'm still not sure I get the conservatism. if (innerbucketsize > thisbucketsize) innerbucketsize = thisbucketsize; if (innermcvfreq > thismcvfreq) innermcvfreq = thismcvfreq; IFAICS, even in the worst case (all columns are totally correlated), the overall bucket size should be the smallest bucket size among clauses (not the largest). And the same is true of MCV. As a mental experiment, we can add a new clause to hash join, which is always true because columns on both sides have the same value. In fact, it would have almost no influence except for the cost of extracting additional columns and the cost of executing additional operators. But in the current model, this additional clause would completely ruin thisbucketsize and thismcvfreq, making hash join extremely unappealing. Should we still revise this to calculate minimum instead of maximum? I've slightly revised the patch. I've run pg_indent and renamed s/saveList/origin_rinfos/g for better readability. Also, the patch badly needs regression test coverage. We can't include costs in expected outputs. But that could be some plans, which previously were reliably merge joins but now become reliable hash joins. ------ Regards, Alexander Korotkov Supabase
v2-0001-Use-extended-statistics-for-precise-estimation-of.patch
Description: Binary data