Re: [HACKERS] Two pass CheckDeadlock in contentent case

Yura Sokolov Fri, 20 Jul 2018 14:29:02 -0700

11.07.2018 17:01, Ashutosh Bapat пишет:
> The patch still applies and it's part of this commitfest.
> 
> On Tue, Oct 3, 2017 at 8:36 PM, Sokolov Yura <y.soko...@postgrespro.ru> wrote:
>> On 2017-10-03 17:30, Sokolov Yura wrote:
>>>
>>> Good day, hackers.
>>>
>>> During hard workload sometimes process reaches deadlock timeout
>>> even if no real deadlock occurred. It is easily reproducible with
>>> pg_xact_advisory_lock on same value + some time consuming
>>> operation (update) and many clients.
>>>
>>> When backend reaches deadlock timeout, it calls CheckDeadlock,
>>> which locks all partitions of lock hash in exclusive mode to
>>> walk through graph and search for deadlock.
>>>
>>> If hundreds of backends reaches this timeout trying to acquire
>>> advisory lock on a same value, it leads to hard-stuck for many
>>> seconds, cause they all traverse same huge lock graph under
>>> exclusive lock.
>>> During this stuck there is no possibility to do any meaningful
>>> operations (no new transaction can begin).
>>>
>>> Attached patch makes CheckDeadlock to do two passes:
>>> - first pass uses LW_SHARED on partitions of lock hash.
>>>   DeadLockCheck is called with flag "readonly", so it doesn't
>>>   modify anything.
>>> - If there is possibility of "soft" or "hard" deadlock detected,
>>>   ie if there is need to modify lock graph, then partitions
>>>   relocked with LW_EXCLUSIVE, and DeadLockCheck is called again.
>>>
>>> It fixes hard-stuck, cause backends walk lock graph under shared
>>> lock, and found that there is no real deadlock.
>>>
>>> Test on 4 socket xeon machine:
>>> pgbench_zipf -s 10 -c 800  -j 100 -M prepared -T 450 -f
>>> ./ycsb_read_zipf.sql@50 -f ./ycsb_update_lock2_zipf.sql@50 -P 5
>>>
>>> ycsb_read_zipf.sql:
>>>     \set i random_zipfian(1, 100000 * :scale, 1.01)
>>>     SELECT abalance FROM pgbench_accounts WHERE aid = :i;
>>> ycsb_update_lock2_zipf.sql:
>>>     \set i random_zipfian(1, 100000 * :scale, 1.01)
>>>     select lock_and_update( :i );
>>>
>>>     CREATE OR REPLACE FUNCTION public.lock_and_update(i integer)
>>>      RETURNS void
>>>      LANGUAGE sql
>>>     AS $function$
>>>         SELECT pg_advisory_xact_lock( $1 );
>>>         UPDATE pgbench_accounts SET abalance = 1 WHERE aid = $1;
>>>     $function$
>>>
>>> Without attached patch:
>>>
>>> progress: 5.0 s, 45707.0 tps, lat 15.599 ms stddev 83.757
>>> progress: 10.0 s, 51124.4 tps, lat 15.681 ms stddev 78.353
>>> progress: 15.0 s, 52293.8 tps, lat 15.327 ms stddev 77.017
>>> progress: 20.0 s, 51280.4 tps, lat 15.603 ms stddev 78.199
>>> progress: 25.0 s, 47278.6 tps, lat 16.795 ms stddev 83.570
>>> progress: 30.0 s, 41792.9 tps, lat 18.535 ms stddev 93.697
>>> progress: 35.0 s, 12393.7 tps, lat 33.757 ms stddev 169.116
>>> progress: 40.0 s, 0.0 tps, lat -nan ms stddev -nan
>>> progress: 45.0 s, 0.0 tps, lat -nan ms stddev -nan
>>> progress: 50.0 s, 1.2 tps, lat 2497.734 ms stddev 5393.166
>>> progress: 55.0 s, 0.0 tps, lat -nan ms stddev -nan
>>> progress: 60.0 s, 27357.9 tps, lat 160.622 ms stddev 1866.625
>>> progress: 65.0 s, 38770.8 tps, lat 20.829 ms stddev 104.453
>>> progress: 70.0 s, 40553.2 tps, lat 19.809 ms stddev 99.741
>>>
>>> (autovacuum led to trigger deadlock timeout,
>>>  and therefore, to stuck)
>>>
>>> Patched:
>>>
>>> progress: 5.0 s, 56264.7 tps, lat 12.847 ms stddev 66.980
>>> progress: 10.0 s, 55389.3 tps, lat 14.329 ms stddev 71.997
>>> progress: 15.0 s, 50757.7 tps, lat 15.730 ms stddev 78.222
>>> progress: 20.0 s, 50797.3 tps, lat 15.736 ms stddev 79.296
>>> progress: 25.0 s, 48485.3 tps, lat 16.432 ms stddev 82.720
>>> progress: 30.0 s, 45202.1 tps, lat 17.733 ms stddev 88.554
>>> progress: 35.0 s, 40035.8 tps, lat 19.343 ms stddev 97.787
>>> progress: 40.0 s, 14240.1 tps, lat 47.625 ms stddev 265.465
>>> progress: 45.0 s, 30150.6 tps, lat 31.140 ms stddev 196.114
>>> progress: 50.0 s, 38975.0 tps, lat 20.543 ms stddev 104.281
>>> progress: 55.0 s, 40573.9 tps, lat 19.770 ms stddev 99.487
>>> progress: 60.0 s, 38361.1 tps, lat 20.693 ms stddev 103.141
>>> progress: 65.0 s, 39793.3 tps, lat 20.216 ms stddev 101.080
>>> progress: 70.0 s, 38387.9 tps, lat 20.886 ms stddev 104.482
> 
> I am confused about interpreting these numbers. Does it mean that
> without the patch at the end of 70s, we have completed 40553.2 * 70
> transactions and with the patch there are 70 * 38387.9 transactions
> completed in 70 seconds?


It is regular pgbench output, so there is no source for confusion.
tps numbers is number of transactions completed in that particular
5 second interval. That is why there are 0 tps and 1 tps intervals
without patch. Which way 0tps and 1tps could be if tps were average
since start?

It means that without patch there were
46k+51k+52k+47k+42k+12k+0k+0k+0k+0k+27k+39k+41k=357 thousands of
transactions completed.

And with patch there were
56k+55k+51k+51k+48k+45k+40k+14k+30k+39k+41k+38k+40k+38k=586 thousands of
transactions completed.

> I agree that with the patch there is smoother degradation when a lot
> of backends enter deadlock detection phase. That's nice, but if my
> interpretation of numbers is correct, that smoothness comes with a
> cost of decreased TPS. So, it would make sense to have a GUC
> controlling this behaviour.

Your interpretation is not correct.

> Talking about GUCs, deadlock_timeout [1] value decides when the
> deadlock check kicks in. In such cases probably increasing it would
> suffice and we may not need the patch.

Increasing deadlock_timeout solves this issue.
But increasing deadlock timeout is very-very bad decision in presence of
real deadlocks.

>> Excuse me, corrected version is in attach.
>>
> 
> About the patch itself, I think releasing the shared lock and taking
> exclusive lock on lock partitions, may lead to starvation. If there
> are multiple backends that enter deadlock detection simultaneously,
> all of them will find that there's a deadlock and one of them will
> succeed in resolving it only after all of them release their
> respective locks. Also there's a comment at the end about the order in
> which to release the lock

It is not worse than behavior without patch. Without patch when real
deadlock happends and multiple backends enters deadlock detection, all
of them (except first) will wait for first backend to release all
EXCLUSIVE locks only to lock them again and then to find that deadlock
eliminated.
Given there is N backends in deadlock, there will be 1*X+(N-1)*Y=Q total
time without patch, where X is time to detect and resolve deadlock, and
Y is a time to check for deadlock after resolving.

With patch backends will do first pass in parallel, so it will be
Z*(1+e)+Q total time, where Z is a time to make first fast imprecise
check, and e is a time drift due to process scheduling.
Note, that in reality all this times usually much smaller than 1 second
default for deadlock_timeout, so that backends will wait for
deadlock_timeout much longer that additional Z*(1+e) time need for first
inexact pass.
And, if some backends tries to detect deadlock a bit later, ie after one
backend already takes exclusive lock, then they will not fully
participate in (N-1)*Y, because they will detect that deadlock were
eliminated using shared locks ie in parallel.

> 
>     /*
>      * And release locks.  We do this in reverse order for two reasons: (1)
>      * Anyone else who needs more than one of the locks will be trying to lock
>      * them in increasing order; we don't want to release the other process
>      * until it can get all the locks it needs. (2) This avoids O(N^2)
>      * behavior inside LWLockRelease.
>      */
> check_done:
>     for (i = NUM_LOCK_PARTITIONS; --i >= 0;)
>         LWLockRelease(LockHashPartitionLockByIndex(i));
> 
> Is that scenario applicable when the shared locks are released in the
> first phase?

I release shared locks in same order, as described in this comment, so I
don't understand what you mean.

With regards,
Sokolov Yura

signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] Two pass CheckDeadlock in contentent case

Reply via email to