Re: [HACKERS] Hot Standby conflict resolution handling

Pavan Deolasee Tue, 04 Dec 2012 04:32:55 -0800

On Tue, Dec 4, 2012 at 1:44 PM, Andres Freund <[email protected]>wrote:


>
> >
> > After max_standby_streaming_delay, the standby starts cancelling the
> > queries. I get an error like this on the standby:
> > postgres=# explain verbose select count(b) from test WHERE a > 100000;
> > FATAL:  terminating connection due to conflict with recovery
> > DETAIL:  User query might have needed to see row versions that must be
> > removed.
> > HINT:  In a moment you should be able to reconnect to the database and
> > repeat your command.
> > server closed the connection unexpectedly
> >     This probably means the server terminated abnormally
> >     before or while processing the request.
> > The connection to the server was lost. Attempting reset: Succeeded.
> >
> > So I've couple questions/concerns here
> >
> > 1. Why to throw a FATAL error here ? A plain ERROR should be enough to
> > abort the transaction. There are four places in ProcessInterrupts() where
> > we throw these kind of errors and three of them are FATAL.
>
> The problem here is that were in IDLE IN TRANSACTION in this case. Which
> currently cannot be cancelled (i.e. pg_cancel_backend() just won't do
> anything).
>
> There are two problems making this non-trivial. For one, while we're in
> IDLE IN TXN the client doesn't expect a response on a protocol level, so
> we can't simply ereport() at that time.
> For another, when were in IDLE IN TXN we're potentially inside openssl
> so we can't jump out of there anyway because that would quite likely
> corrupt the internal state of openssl.
>
> I tried to fix this before (c.f. "Idle in transaction cancellation" or
> similar) but while I had some kind of fix for the first issue (i saved
> the error and reported it later when the protocol state allows it) I
> missed the jumping out of openssl bit. I think its not that hard to
> solve though. I remember having something preliminary but I never had
> the time to finish it. If I remember correctly the trick was to set
> openssl into non-blocking mode temporarily and return to the caller
> inside be-secure.c:my_sock_read.
>

Thanks Andres. I also read the original thread and I now understand why we
are using FATAL here, at least until we have a better solution. Obviously
the connection reset is no good either because as someone commented in the
original discussion, I thought that I'm seeing a server crash while it was
not.



>
> >
> > AFAICS the first of these should be ereport(ERROR). Otherwise
> irrespective
> > of whether RecoveryConflictRetryable is true or false, we will always
> > ereport(FATAL).
>
> Which is fine, because were below if (ProcDiePending). Note there's a
> separate path for QueryCancelPending. We go on to kill connections once
> the normal conflict handling has tried several times.
>
>
Ok. Understood.I now see that every path below if (ProcDiePending) will
call FATAL, albeit with different error codes. That explains the current
code.



>
> I think we desparately need to improve *all* of these message with
> significantly more detail (cause for cancellation, relation, current
> xid, conflicting xid, current/last query).
>
>
I agree.

Thanks,
Pavan


-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: [HACKERS] Hot Standby conflict resolution handling

Reply via email to