Re: estimation problems for DISTINCT ON with FDW

Tom Lane Thu, 02 Jul 2020 07:46:56 -0700

Etsuro Fujita <etsuro.fuj...@gmail.com> writes:
> On Wed, Jul 1, 2020 at 11:40 PM Tom Lane <t...@sss.pgh.pa.us> wrote:
>> Short of sending a whole second query to the remote server, it's
>> not clear to me how we could get the full table size (or equivalently
>> the target query's selectivity for that table).  The best we realistically
>> can do is to adopt pg_class.reltuples if there's been an ANALYZE of
>> the foreign table.  That case already works (and this proposal doesn't
>> break it).  The problem is what to do when pg_class.reltuples is zero
>> or otherwise badly out-of-date.


> In estimate_path_cost_size(), if use_remote_estimate is true, we
> adjust the rows estimate returned from the remote server, by factoring
> in the selectivity of the locally-checked quals.  I thought what I
> proposed above would be more consistent with that.

No, I don't think that would be very helpful.  There are really three
different numbers of interest here:

1. The actual total rowcount of the remote table.

2. The number of rows returned by the remote query (which is #1 times
the selectivity of the shippable quals).

3. The number of rows returned by the foreign scan (which is #2 times
the selectivity of the non-shippable quals)).

Clearly, rel->rows should be set to #3.  However, what we really want
for rel->tuples is #1.  That's because, to the extent that the planner
inspects rel->tuples at all, it's to adjust whole-table stats such as
we might have from ANALYZE.  What you're suggesting is that we use #2,
but I doubt that that's a big improvement.  In a decently tuned query
it's going to be a lot closer to #3 than to #1.

We could perhaps try to make our own estimate of the selectivity of the
shippable quals and then back into #1 from the value we got for #2 from
the remote server.  But that sounds mighty error-prone, so I doubt it'd
make for much of an improvement.  It also doesn't sound like something
I'd want to back-patch.

Another point here is that, to the extent we are relying on whole-table
stats from the last ANALYZE, pg_class.reltuples is actually the right
value to go along with that.  We could spend a lot of cycles doing
what I just suggested and end up with net-worse estimates.

In any case, the proposal I'm making is just to add a sanity-check
clamp to prevent the worst effects of not setting rel->tuples sanely.
It doesn't foreclose future improvements inside the FDW.

                        regards, tom lane

Re: estimation problems for DISTINCT ON with FDW

Reply via email to