Re: [HACKERS] pg9.6 segfault using simple query (related to use fk for join estimates)

Tomas Vondra Tue, 03 May 2016 07:10:58 -0700

Hi,

On 04/30/2016 07:35 PM, Tom Lane wrote:

Julien Rouhaud <[email protected]> writes:

On 29/04/2016 18:05, Tom Lane wrote:

Julien Rouhaud <[email protected]> writes:

The segfault is caused by quals_match_foreign_key() calling get_leftop()
and get_rightop() on a ScalarArrayOpExpr node.


I'm not sure that assuming this compatibility is the right way to fix
this though.

It certainly isn't.

Agreed. New attached patch handles explicitly each node tag.


No, this is completely nuts.  The problem is that quals_match_foreign_key
is simply assuming that every clause in the list it's given is an OpExpr,
which is quite wrong.  It should just ignore non-OpExpr quals, since it
cannot do anything useful with them anyway.  There's a comment claiming
that non-OpExpr quals were already rejected:

             * Here since 'usefulquals' only contains bitmap indexes for quals
             * of type "var op var" we can safely skip checking this.

but that doesn't appear to have anything to do with current reality.

Yes, I agree - there should be a check that the node is OpExpr in quals_match_foreign_key. This is clearly a bug :-(


While this in itself is about a two-line fix, after reviewing
137805f89acb3611 I'm pretty unhappy that it got committed at all.
I think this obvious bug is a good sign that it wasn't ready.
Other unfinished aspects like invention of an undocumented GUC
don't leave a good impression either.

The GUC is meant to make testing during development easier. I haven't realized it got committed, but I assume Simon plans to remove it before the final release. Can't check right now with Simon, though, as he's is out of office this week.

In any case, I do agree the GUC should have been documented better, and we should probably put it on the TODO so that it gets removed before the actual release.


Moreover, it looks to me like this will add quite a lot of overhead,
probably far more than is justified, because
clauselist_join_selectivity is basically O(N^2) in the
relation-footprint of the current join --- and not with a real small
multiplier, either, as the functions it calls contain about four
levels of nested loops themselves. Maybe that's unmeasurable on
trivial test cases but it's going to be disastrous in large queries,
or for relations with large numbers of foreign keys.

That depends on a lot of factors, I guess. Attached are two SQL scripts that create 5 tables (f1, f2, f3, f4, f5) and then create N foreign keys on each of them. The foreign keys reference other tables, which means the loops won't terminate early (after finding the first match) when joining the first 5 tables, which makes this the worst case. And then we join different numbers of tables (2, 3, 4 or 5) and do explain analyze to measure planning time.

The first script (fk-test.sql) does joins on 2 columns, fk-test-6.sql does joins on 6 columns (so that we also know how the number of columns

affects the planning time).

Sum of planning time for 100 queries (in milliseconds) with 2 columns, where "2/on" means "join on 2 tables with enable_fk_estimates=on":


N       2/on    2/off   3/on    3/off   4/on    4/off   5/on    5/off
10      6       4       9       8       22      20      77      68
100     15      11      26      19      64      36      177     93
1000    97      84      217     133     516     233     1246    342

and comparison of the timings:

        2       3       4       5
10      155%    115%    114%    113%
100     142%    138%    177%    190%
1000    116%    163%    221%    364%

And with the 6 columns:

        2/on    2/off   3/on    3/off   4/on    4/off   5/on    5/off
10      25      7       23      20      96      82      344     297
100     21      14      65      33      233     104     735     328
1000    151     99      492     153     1603    320     4627    593

and comparison:

        2       3       4       5
10      371%    120%    117%    116%
100     149%    196%    224%    224%
1000    152%    322%    502%    780%

So yes, the overhead is clearly measurable, no doubt about that.


I think this should be reverted and pushed out to 9.7 development.
It needs a significant amount of rewriting to fix the performance
issue, and now's not the time to be doing that.

There are probably a few reasonably simple things we could do - e.g. ignore foreign keys with just a single column, as the primary goal of the patch is improving estimates with multi-column foreign keys. I believe that covers a vast majority of foreign keys in the wild.

If that's deemed insufficient, we'll have to resort to a more complex improvement, perhaps something akin to the cache proposed in in the unijoin patch. But if that's required, that's 9.7 material.



regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

fk-test-6.sql
Description: application/sql

fk-test.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg9.6 segfault using simple query (related to use fk for join estimates)

Reply via email to