On Tue, Dec 4, 2012 at 1:44 PM, Andres Freund <and...@2ndquadrant.com>wrote:
> > > > > After max_standby_streaming_delay, the standby starts cancelling the > > queries. I get an error like this on the standby: > > postgres=# explain verbose select count(b) from test WHERE a > 100000; > > FATAL: terminating connection due to conflict with recovery > > DETAIL: User query might have needed to see row versions that must be > > removed. > > HINT: In a moment you should be able to reconnect to the database and > > repeat your command. > > server closed the connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request. > > The connection to the server was lost. Attempting reset: Succeeded. > > > > So I've couple questions/concerns here > > > > 1. Why to throw a FATAL error here ? A plain ERROR should be enough to > > abort the transaction. There are four places in ProcessInterrupts() where > > we throw these kind of errors and three of them are FATAL. > > The problem here is that were in IDLE IN TRANSACTION in this case. Which > currently cannot be cancelled (i.e. pg_cancel_backend() just won't do > anything). > > There are two problems making this non-trivial. For one, while we're in > IDLE IN TXN the client doesn't expect a response on a protocol level, so > we can't simply ereport() at that time. > For another, when were in IDLE IN TXN we're potentially inside openssl > so we can't jump out of there anyway because that would quite likely > corrupt the internal state of openssl. > > I tried to fix this before (c.f. "Idle in transaction cancellation" or > similar) but while I had some kind of fix for the first issue (i saved > the error and reported it later when the protocol state allows it) I > missed the jumping out of openssl bit. I think its not that hard to > solve though. I remember having something preliminary but I never had > the time to finish it. If I remember correctly the trick was to set > openssl into non-blocking mode temporarily and return to the caller > inside be-secure.c:my_sock_read. > Thanks Andres. I also read the original thread and I now understand why we are using FATAL here, at least until we have a better solution. Obviously the connection reset is no good either because as someone commented in the original discussion, I thought that I'm seeing a server crash while it was not. > > > > > AFAICS the first of these should be ereport(ERROR). Otherwise > irrespective > > of whether RecoveryConflictRetryable is true or false, we will always > > ereport(FATAL). > > Which is fine, because were below if (ProcDiePending). Note there's a > separate path for QueryCancelPending. We go on to kill connections once > the normal conflict handling has tried several times. > > Ok. Understood.I now see that every path below if (ProcDiePending) will call FATAL, albeit with different error codes. That explains the current code. > > I think we desparately need to improve *all* of these message with > significantly more detail (cause for cancellation, relation, current > xid, conflicting xid, current/last query). > > I agree. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee