Fellow Bucardans,

We've run into an issue where replication can seriously lag. One minute we're 
getting < 1s lag, the next it's up to 20s. This is causing problems for tables 
with non-primary unique constraints: Customers are creating usernames and 
submitting them twice in less than 20s, and because they have an equal chance 
of hitting either of our two masters, the very long lag can lead to unique 
constraint violations and the killing of syncs.

I've worked out a method to resolve unique conflicts in custom code, but I 
would also like to see if we can fix the lag. The problem is that the planner 
suddenly switches from one plan to another, and the new, bad plan kills 
performance. The code in question is in this function:

  
https://github.com/bucardo/bucardo/blob/acc21e807ea135b08b4502e5973c31d403b97f8a/bucardo.schema#L1499

We managed to replicate the issue with this query:

    select (select count(*) from bucardo.delta_config_users) d,(select count(*) 
from bucardo.track_config_users) t;

When it's working fine, the plan looks like this:

     Limit  (cost=11158.29..15550.30 rows=1 width=0) (actual 
time=129.648..129.648 rows=0 loops=1)
   ->  Merge Anti Join  (cost=11158.29..15550.30 rows=1 width=0) (actual 
time=129.647..129.647 rows=0 loops=1)
         Merge Cond: (d.txntime = t.txntime)
         ->  Index Only Scan using dex1_config_users on delta_config_users d  
(cost=0.29..2697.87 rows=98307 width=8) (actual time=0.074..10.932 rows=98380 
loops=1)
               Heap Fetches: 4322
         ->  Sort  (cost=11157.96..11316.00 rows=63218 width=8) (actual 
time=102.265..104.908 rows=67198 loops=1)
               Sort Key: t.txntime
               Sort Method: quicksort  Memory: 6222kB
               ->  Seq Scan on track_config_users t  (cost=0.00..6116.94 
rows=63218 width=8) (actual time=0.053..89.108 rows=67198 loops=1)
                     Filter: ((target = 'dbgroup configuration_dbs'::text) OR 
(target ~ '^T:'::text))
                     Rows Removed by Filter: 188175

  https://explain.depesz.com/s/X5i4

But then it suddenly shifts to:

    Limit  (cost=0.00..2575.59 rows=1 width=0) (actual 
time=42928.330..42928.330 rows=0 loops=1)
    ->  Nested Loop Anti Join  (cost=0.00..10845806.73 rows=4211 width=0) 
(actual time=42928.327..42928.327 rows=0 loops=1)
            Join Filter: (d.txntime = t.txntime)
            Rows Removed by Join Filter: 429203342
            ->  Seq Scan on delta_config_users d  (cost=0.00..491.82 rows=31682 
width=8) (actual time=0.003..4.954 rows=31750 loops=1)
            ->  Materialize  (cost=0.00..3196.22 rows=23733 width=8) (actual 
time=0.000..0.485 rows=13519 loops=31750)
                ->  Seq Scan on track_config_users t  (cost=0.00..3077.55 
rows=23733 width=8) (actual time=0.019..29.834 rows=25109 loops=1)
                        Filter: ((target = 'dbgroup configuration_dbs'::text) 
OR (target ~ '^T:'::text))
                        Rows Removed by Filter: 65218

  https://explain.depesz.com/s/X5i4

The difference appears to be a switch from a Merge Anti Join to a Nested Loop 
Anti Join. We vacuumed both delta_config_users) and track_config_users, which 
for some reason fixed the problem, but only for a while.  

Anyone run into something like this before? Can we adjust the table statistics 
to fix the problem?

Oh, and why are there so many records in the deltas table? I thought bucardo 
cleaned records out once replication completed.

Thanks,

David

_______________________________________________
Bucardo-general mailing list
[email protected]
https://bucardo.org/mailman/listinfo/bucardo-general

Reply via email to