Robert Haas <robertmh...@gmail.com> writes:
> On Thu, Feb 11, 2016 at 11:34 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> The problem here is that when the deadlock detector kills s8's
>> transaction, s7a8 is also left free to proceed, so there is a race
>> condition as to which query completion will get back to
>> isolationtester first.
>> 
>> One grotty way to handle that would be something like
>> 
>> -step "s7a8"    { LOCK TABLE a8; }
>> +step "s7a8"    { LOCK TABLE a8; SELECT pg_sleep(5); }
>> 
>> Or we could simplify the locking structure enough so that no other
>> transactions are released by the deadlock failure.  I do not know
>> exactly what you had in mind to be testing here?

> Basically just that the deadlock actually got detected.   That may
> sound a bit weak, but considering we had no test for it at all before
> this...

I tried fixing it as shown above, and was dismayed to find out that
it didn't work, ie, there was still a difference between the regular
output and the results with CLOBBER_CACHE_ALWAYS.  In the latter case
the printout makes it appear that s7a8 completed before s8a1, which
is nonsensical.

Investigation showed that there are a couple of reasons.  One,
isolationtester's is-it-waiting query takes an insane amount of
time under CLOBBER_CACHE_ALWAYS --- over half a second on my
reasonably new server.  Probing the state of half a dozen blocked
sessions thus takes a while.  Second, once s8 has been booted out
of its transaction, s7 is no longer "blocked" according to
isolationtester's definition (it's doing the pg_sleep query
instead).  Therefore, when we're rechecking all the other blocked
steps after detecting that s8 has become blocked, two things
happen: enough time elapses for the deadlock detector to fire,
and then when we get around to checking s7, we don't see it as
blocked and therefore wait until it finishes.  So s7a8 is reported
first despite the pg_sleep, and would be no matter large a pg_sleep
delay is used.

We could possibly fix this by using a deadlock timeout even higher than
5 seconds, but that way madness lies.

Instead, what I propose we do about this is to change isolationtester
so that once it's decided that a given step is blocked, it no longer
issues the is-it-waiting query for that step; it just assumes that the
step should be treated as blocked.  So all we need do for "backlogged"
steps is check PQisBusy/PQconsumeInput.  That both greatly reduces the
number of is-it-waiting queries that are needed and avoids any flappy
behavior of the answer.

Comments?

                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to