On Thu, May 4, 2017 at 1:19 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Apr 20, 2017 at 10:27 AM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
>> The logs above show that 34 seconds elapsed between starting to abort
>> the transaction and knowing that the foreign server is unreachable. It
>> looks like it took that much time for the local server to realise that
>> the foreign server is not reachable. Looking at PQcancel code, it
>> seems to be trying to connect to the foreign server to cancel the
>> query. But somehow it doesn't seem to honor connect_timeout setting.
>> Is that expected?
> Well, there's no code to do anything else.  Regular connections go
> through connectDBComplete() which uses non-blocking mode and timed
> waits to respect connection_timeout.  PQcancel() has no such handling.
> internal_cancel just opens a socket and, without setting any options
> on it, calls connect().  No loop, no timed waits, nada.  So it's going
> to take as long as it takes for the operating system to notice.
>> Irrespective of what PQcancel does, it looks like postgres_fdw should
>> just slam the connection if query is being aborted because of
>> statement_timeout. But then pgfdw_xact_callback() doesn't seem to have
>> a way to know whether this ABORT is because of user's request to
>> cancel the query, statement timeout, an abort because of some other
>> error or a user requested abort. Except statement timeout (may be
>> user's request to cancel the query?), it should try to keep the
>> connection around to avoid any future reconnection. But I am not able
>> to see how can we provide that information to pgfdw_xact_callback().
> I don't think we can.  In general, PostgreSQL has no facility for
> telling error cleanup handlers why the abort happened, and it wouldn't
> really be the right thing anyway.  The fact that statement_timeout
> fired doesn't necessarily mean that the connection is dead; it could
> equally well mean that the query ran for a long time.  I think we
> basically have two choices.  One is to bound the amount of time we
> spend performing error cleanup, and the other is just to always drop
> the connection.  A problem with the latter is that it might do the
> wrong thing if we're aborting a subtransaction but not the whole
> transaction.  In that case, we need to undo changes since the relevant
> savepoint, but not the ones made before that; closing the connection
> amounts to a full rollback.
> Therefore, I think the patch you proposed in
> https://www.postgresql.org/message-id/CAFjFpRdcWw4h0a-zrL-EiaekkPj8O0GR2M1FwZ1useSRfRm3-g%40mail.gmail.com
> isn't correct, because if the cancel fails it will slam the connection
> shut regardless of whether we're in pgfdw_xact_callback or
> pgfdw_subxact_callback.
> It looks to me like there's a fairly lengthy list of conceptually
> separate but practically related problems here:
> 1. PQcancel() ignores the keepalive parameters, because it makes no
> attempt to set the relevant socket options before connecting.
> 2. PQcancel() ignores connection_timeout, because it doesn't use
> non-blocking mode and has no wait loop.
> 3. There is no asynchronous equivalent of PQcancel(), so we can't use
> a loop to allow responding to interrupts while the cancel is
> outstanding (see pgfdw_get_result for an example).
> 4. pgfdw_xact_callback() and pgfdw_subxact_callback() use PQexec()
> rather than the asynchronous interfaces for sending queries and
> checking for results, so the SQL commands they send are not
> interruptible.
> 5. pgfdw_xact_callback() and pgfdw_subxact_callback() try to get the
> connection back to a good state by using ABORT TRANSACTION for the
> former and ROLLBACK TO SAVEPOINT and RELEASE SAVEPOINT for the latter.
> But if it doesn't work, then they just emit a WARNING and continue on
> as if they'd succeeded.  That seems highly likely to make the next use
> of that connection fail in unpredictable ways.

In pgfdw_xact_callback, if the execution of ABORT TRANSACTION fails
due to any reason then I think it will close the connection.  The
relavant code is:
if (PQstatus(entry->conn) != CONNECTION_OK ||
PQtransactionStatus(entry->conn) != PQTRANS_IDLE)

Basically, if abort transaction fails then transaction status won't be
PQTRANS_IDLE.  Also, does it matter if pgfdw_subxact_callback fails
during execution of "ROLLBACK TO SAVEPOINT", because anyway user needs
to execute rollback to savepoint command before proceeding and that
should be good enough?

About patch:

+ /* Rollback all remote subtransactions during abort */
+ snprintf(sql, sizeof(sql),
+ curlevel, curlevel);
+ if (!pgfdw_exec_cleanup_query(entry->conn, sql))
+ abort_cleanup_failure = true;

If the above execution fails due to any reason, then it won't allow
even committing the mail transaction which I am not sure is the right
thing to do.  I have tried it manually editing the above command in
debugger and the result is as below:

Create a foreign table and try below scenario.

postgres=# begin;
postgres=# savepoint s1;
postgres=# insert into foreign_test values(8);
postgres=# savepoint s2;
postgres=# insert into foreign_test values(generate_series(1,1000000));
Cancel request sent
WARNING:  syntax error at or near "OOLLBACK"
ERROR:  canceling statement due to user request

-- In the debugger, I have changed ROLLBACK to mimic the failure.

postgres=# Rollback to savepoint s2;
postgres=# commit;
ERROR:  connection to server "foreign_server" was lost

Considering above, I am not sure if we should consider all failures
during abort of subtransaction to indicate pending cleanup state and
then don't allow further commits.

 #include "utils/memutils.h"
+#include "utils/syscache.h"

Looks like spurious line removal

> 6. postgres_fdw doesn't use PQsetnonblocking() anywhere, so even if we
> converted pgfdw_xact_callback() and pgfdw_subxact_callback() to use
> pgfdw_exec_query() or something like it rather than PQexec() to submit
> queries, they might still block if we fail to send the query, as the
> comments for pgfdw_exec_query() explain.  This is not possibly but not
> particularly likely to happen for queries being sent out of error
> handling paths, because odds are good that the connection was sent
> just before.  However, if the user is pressing ^C because the remote
> server isn't responding, it's quite probable that we'll run into this
> exact issue.
> 7. postgres_fdw never considers slamming the connection shut as a
> response to trouble.  It seems pretty clear that this is only a safe
> response if we're aborting the toplevel transaction.  If we did it for
> a subtransaction, we'd end up reconnecting if the server were accessed
> again, which would at the very least change the snapshot (note that we
> use at least REPEATABLE READ on the remote side regardless of the
> local isolation level) and might discard changes made on the remote
> side at outer transaction levels.  Even for a top-level transaction,
> it might not always be the right way to proceed, as it incurs some
> overhead to reconnect, but it seems worth considering at least in the
> case where we've already got some indication that the connection is
> unhealthy (e.g. a previous attempt to clean up the connection state
> failed, as in #5, or PQcancel failed, as in Ashutosh's proposed
> patch).

> 8. Before 9.6, PQexec() is used in many more places, so many more
> queries are entirely non-interruptible.
> It seems pretty clear to me that we are not going to fix all of these
> issues in one patch.  Here's a sketch of an idea for how to start
> making things better:
> - Add an in_abort_cleanup flag to ConnCacheEntry.
> - Before pgfdw_xact_callback() or pgfdw_subxact_callback() begin abort
> cleanup, check whether the flag is set.  If so, slam the connection
> shut unless that's already been done; furthermore, if the flag is set
> and we're in pgfdw_xact_callback (i.e. this is a toplevel abort),
> forget about the connection entry entirely.  On the other hand, if the
> flag is not set, set it flag and attempt abort cleanup.  If we
> succeed, clear the flag.
> - Before pgfdw_xact_callback() or pgfdw_subxact_callback() begin
> pre-commit processing, check whether the flag is set.  If so, throw an
> ERROR, so that we switch over to abort processing.
> - Change uses of PQexec() in the abort path to use pgfdw_exec_query()
> instead.  If we exit pgfdw_exec_query() with an error, we'll re-enter
> the abort path, but now in_abort_cleanup will be set, so we'll just
> drop the connection (and force any outer transaction levels to abort
> as well).
> - For bonus points, give pgfdw_exec_query() an optional timeout
> argument, and set it to 30 seconds or so when we're doing abort
> cleanup.  If the timeout succeeds, it errors out (which again
> re-enters the abort path, dropping the connection and forcing outer
> transaction levels to abort as well).

Why do we need this 30 seconds timeout for abort cleanup failures?
Isn't it sufficient if we propagate the error only on failures?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to