Re: [HACKERS] Getting sorted data from foreign server

Robert Haas Tue, 13 Oct 2015 09:38:17 -0700

On Tue, Oct 13, 2015 at 3:29 AM, Ashutosh Bapat
<[email protected]> wrote:
>> - You consider pushing down ORDER BY if any prefix of the query
>> pathkeys satisfy is_foreign_expr(), but that doesn't seem right to me.
>> If the user says SELECT * FROM remotetab ORDER BY a, unsafe(a),
>> ordering the result set by a doesn't help us much.  We've talked a few
>> times about an incremental sort capability that would take a stream of
>> tuples already ordered by one or more columns and sort each group by
>> additional columns, but I don't think we have that currently.  Without
>> that capability, I don't think there's much benefit in sorting by a
>> prefix of the pathkeys.  I suspect that if we can't get them all, it's
>> not worth doing.
>
> I somehow thought, we are using output, which is ordered by prefix of
> pathkeys in Sort nodes. But as you rightly pointed out that's not the case.
> Only complete pathkeys are useful.


A truncated list of pathkeys is useful for merge joins, but not for
toplevel ordering.

>> - Right now, you have this code below the point where we bail out if
>> use_remote_estimate is not set.  If we keep it like that, the comment
>> needs updating.  But I suggest that we consider an ordered path even
>> if we are not using remote estimates.  Estimate the sorted path to
>> cost somewhat more than the unsorted path, so that we only choose that
>> path if the sort actually benefits us.  I don't know exactly how to
>> come up with a principled estimate given that we don't actually know
>> whether the remote side will need an extra sort or not, but maybe a
>> dumb estimate is still better than not trying a sorted path.
>
> I like that idea, although there are two questions
> 1. How can we estimate cost of getting the data sorted? If there is an
> appropriate index on foreign server we can get the data sorted at no extra
> cost. If there isn't the cost of sorting is proportionate to NlogN where N
> is the size of data. It seems unreasonable to arrive at the cost of sorting
> by multiplying with some constant multiplier. Also, the constant multiplier
> to the NlogN estimate depends heavily upon the properties of foreign server
> e.g. size of memory available for sorting, disk and CPU speed etc. The last
> two might have got factored into fdw_tuple_cost and fdw_startup_cost, so
> that's probably taken care of. If the estimate we come up turns out to be
> too pessimistic, we will not get sorted data even if that's the right thing
> to do. If too optimistic, we will incur heavy cost at the time of execution.
> Setting the cost estimate to some constant factor of NlogN would be too
> pessimistic if there is an appropriate index on foreign server. Otherway
> round if there isn't an appropriate index on foreign server.
>
> Even if we leave these arguments aside for a while, the question remains as
> to what should be the constant factor 10% or 20% or 50% or 100% or something
> else on top of the estimate for simple foreign table scan estimates (or
> NlogN of that)? I am unable to justify any of these factors myself. What do
> you say?

I think we want to estimate the cost in such a way that we'll tend to
pick the ordered path if it's useful, but skip it if it's not.  So,
say we pick 10%.  That's definitely enough that we won't pick a remote
sort when it's useless, but it's small enough that if a remote sort is
useful, we will probably choose to do it.  I think that's what we
want.  I believe we should err on the side of a small estimate because
it's generally better to do as much work as possible on the remote
side.  In some cases the sort may turn out to be free at execution
time because the remote server was going to generate the results in
that order anyway, and it may know that because of its own pathkeys,
and thus be able to skip the explicit ordering step.

>> - In the long run, we should probably either add some configuration so
>> that the local side can make better estimates even without
>> use_remote_estimate, or maybe have a way for the FDW to keep a cache
>> of data in the system catalogs that is updated by ANALYZE.  Then,
>> analyzing a foreign table could store information locally about
>> indexes and so forth, instead of starting each query planning cycle
>> with no knowledge about the remote side.  That's not a matter for this
>> patch, I don't think, but it seems like something we should do.
>
> To an extent knowing which indexes are available on the tables on foreign
> server will help. Now, I do understand that not every foreign server will
> have indexes like PostgreSQL, but as long as whatever they have can be
> translated into a language that PostgreSQL can understand it should be fine.
> From that point of view, will it help if we have declarative indexes on
> foreign tables similar to the declarative constraints? Obviously, we will be
> burdening user with extra work of maintaining the declarative indexes in
> sync like we do for constraints. But we might ease the burden when we get to
> fetch that information automatically from the foreign server.

I think fetching the information from the remote server automatically
is better than forcing the user to declare it.  Quite aside from the
administrative burden, there's no guarantee that the remote data
source's indexing capabilities are representable in our system catalog
structure.  For postgres_fdw they likely will be, unless the remote
server is a newer version with newer magical abilities, but other
systems might behave in ways that don't map onto our notions of what
an index looks like.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Getting sorted data from foreign server

Reply via email to