Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw

Amit Kapila Thu, 04 May 2017 04:14:14 -0700

On Thu, May 4, 2017 at 1:19 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Apr 20, 2017 at 10:27 AM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
>> The logs above show that 34 seconds elapsed between starting to abort
>> the transaction and knowing that the foreign server is unreachable. It
>> looks like it took that much time for the local server to realise that
>> the foreign server is not reachable. Looking at PQcancel code, it
>> seems to be trying to connect to the foreign server to cancel the
>> query. But somehow it doesn't seem to honor connect_timeout setting.
>> Is that expected?
>
> Well, there's no code to do anything else.  Regular connections go
> through connectDBComplete() which uses non-blocking mode and timed
> waits to respect connection_timeout.  PQcancel() has no such handling.
> internal_cancel just opens a socket and, without setting any options
> on it, calls connect().  No loop, no timed waits, nada.  So it's going
> to take as long as it takes for the operating system to notice.
>
>> Irrespective of what PQcancel does, it looks like postgres_fdw should
>> just slam the connection if query is being aborted because of
>> statement_timeout. But then pgfdw_xact_callback() doesn't seem to have
>> a way to know whether this ABORT is because of user's request to
>> cancel the query, statement timeout, an abort because of some other
>> error or a user requested abort. Except statement timeout (may be
>> user's request to cancel the query?), it should try to keep the
>> connection around to avoid any future reconnection. But I am not able
>> to see how can we provide that information to pgfdw_xact_callback().
>
> I don't think we can.  In general, PostgreSQL has no facility for
> telling error cleanup handlers why the abort happened, and it wouldn't
> really be the right thing anyway.  The fact that statement_timeout
> fired doesn't necessarily mean that the connection is dead; it could
> equally well mean that the query ran for a long time.  I think we
> basically have two choices.  One is to bound the amount of time we
> spend performing error cleanup, and the other is just to always drop
> the connection.  A problem with the latter is that it might do the
> wrong thing if we're aborting a subtransaction but not the whole
> transaction.  In that case, we need to undo changes since the relevant
> savepoint, but not the ones made before that; closing the connection
> amounts to a full rollback.
>
> Therefore, I think the patch you proposed in
> https://www.postgresql.org/message-id/CAFjFpRdcWw4h0a-zrL-EiaekkPj8O0GR2M1FwZ1useSRfRm3-g%40mail.gmail.com
> isn't correct, because if the cancel fails it will slam the connection
> shut regardless of whether we're in pgfdw_xact_callback or
> pgfdw_subxact_callback.
>
> It looks to me like there's a fairly lengthy list of conceptually
> separate but practically related problems here:
>
> 1. PQcancel() ignores the keepalive parameters, because it makes no
> attempt to set the relevant socket options before connecting.
>
> 2. PQcancel() ignores connection_timeout, because it doesn't use
> non-blocking mode and has no wait loop.
>
> 3. There is no asynchronous equivalent of PQcancel(), so we can't use
> a loop to allow responding to interrupts while the cancel is
> outstanding (see pgfdw_get_result for an example).
>
> 4. pgfdw_xact_callback() and pgfdw_subxact_callback() use PQexec()
> rather than the asynchronous interfaces for sending queries and
> checking for results, so the SQL commands they send are not
> interruptible.
>
> 5. pgfdw_xact_callback() and pgfdw_subxact_callback() try to get the
> connection back to a good state by using ABORT TRANSACTION for the
> former and ROLLBACK TO SAVEPOINT and RELEASE SAVEPOINT for the latter.
> But if it doesn't work, then they just emit a WARNING and continue on
> as if they'd succeeded.  That seems highly likely to make the next use
> of that connection fail in unpredictable ways.
>


In pgfdw_xact_callback, if the execution of ABORT TRANSACTION fails
due to any reason then I think it will close the connection.  The
relavant code is:
if (PQstatus(entry->conn) != CONNECTION_OK ||
PQtransactionStatus(entry->conn) != PQTRANS_IDLE)

Basically, if abort transaction fails then transaction status won't be
PQTRANS_IDLE.  Also, does it matter if pgfdw_subxact_callback fails
during execution of "ROLLBACK TO SAVEPOINT", because anyway user needs
to execute rollback to savepoint command before proceeding and that
should be good enough?

About patch:

+ /* Rollback all remote subtransactions during abort */
+ snprintf(sql, sizeof(sql),
+ "ROLLBACK TO SAVEPOINT s%d; RELEASE SAVEPOINT s%d",
+ curlevel, curlevel);
+ if (!pgfdw_exec_cleanup_query(entry->conn, sql))
+ abort_cleanup_failure = true;

If the above execution fails due to any reason, then it won't allow
even committing the mail transaction which I am not sure is the right
thing to do.  I have tried it manually editing the above command in
debugger and the result is as below:

Create a foreign table and try below scenario.

postgres=# begin;
BEGIN
postgres=# savepoint s1;
SAVEPOINT
postgres=# insert into foreign_test values(8);
INSERT 0 1
postgres=# savepoint s2;
SAVEPOINT
postgres=# insert into foreign_test values(generate_series(1,1000000));
Cancel request sent
WARNING:  syntax error at or near "OOLLBACK"
ERROR:  canceling statement due to user request

-- In the debugger, I have changed ROLLBACK to mimic the failure.

postgres=# Rollback to savepoint s2;
ROLLBACK
postgres=# commit;
ERROR:  connection to server "foreign_server" was lost

Considering above, I am not sure if we should consider all failures
during abort of subtransaction to indicate pending cleanup state and
then don't allow further commits.

 #include "utils/memutils.h"
-
+#include "utils/syscache.h"

Looks like spurious line removal

> 6. postgres_fdw doesn't use PQsetnonblocking() anywhere, so even if we
> converted pgfdw_xact_callback() and pgfdw_subxact_callback() to use
> pgfdw_exec_query() or something like it rather than PQexec() to submit
> queries, they might still block if we fail to send the query, as the
> comments for pgfdw_exec_query() explain.  This is not possibly but not
> particularly likely to happen for queries being sent out of error
> handling paths, because odds are good that the connection was sent
> just before.  However, if the user is pressing ^C because the remote
> server isn't responding, it's quite probable that we'll run into this
> exact issue.
>
> 7. postgres_fdw never considers slamming the connection shut as a
> response to trouble.  It seems pretty clear that this is only a safe
> response if we're aborting the toplevel transaction.  If we did it for
> a subtransaction, we'd end up reconnecting if the server were accessed
> again, which would at the very least change the snapshot (note that we
> use at least REPEATABLE READ on the remote side regardless of the
> local isolation level) and might discard changes made on the remote
> side at outer transaction levels.  Even for a top-level transaction,
> it might not always be the right way to proceed, as it incurs some
> overhead to reconnect, but it seems worth considering at least in the
> case where we've already got some indication that the connection is
> unhealthy (e.g. a previous attempt to clean up the connection state
> failed, as in #5, or PQcancel failed, as in Ashutosh's proposed
> patch).
>




> 8. Before 9.6, PQexec() is used in many more places, so many more
> queries are entirely non-interruptible.
>
> It seems pretty clear to me that we are not going to fix all of these
> issues in one patch.  Here's a sketch of an idea for how to start
> making things better:
>
> - Add an in_abort_cleanup flag to ConnCacheEntry.
>
> - Before pgfdw_xact_callback() or pgfdw_subxact_callback() begin abort
> cleanup, check whether the flag is set.  If so, slam the connection
> shut unless that's already been done; furthermore, if the flag is set
> and we're in pgfdw_xact_callback (i.e. this is a toplevel abort),
> forget about the connection entry entirely.  On the other hand, if the
> flag is not set, set it flag and attempt abort cleanup.  If we
> succeed, clear the flag.
>
> - Before pgfdw_xact_callback() or pgfdw_subxact_callback() begin
> pre-commit processing, check whether the flag is set.  If so, throw an
> ERROR, so that we switch over to abort processing.
>
> - Change uses of PQexec() in the abort path to use pgfdw_exec_query()
> instead.  If we exit pgfdw_exec_query() with an error, we'll re-enter
> the abort path, but now in_abort_cleanup will be set, so we'll just
> drop the connection (and force any outer transaction levels to abort
> as well).
>
> - For bonus points, give pgfdw_exec_query() an optional timeout
> argument, and set it to 30 seconds or so when we're doing abort
> cleanup.  If the timeout succeeds, it errors out (which again
> re-enters the abort path, dropping the connection and forcing outer
> transaction levels to abort as well).
>

Why do we need this 30 seconds timeout for abort cleanup failures?
Isn't it sufficient if we propagate the error only on failures?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw

Reply via email to