Has anyone seen this before?

We're using pg 9.1.5 on solaris 10, perl 5.8.4, and bucardo 4.99.7. We have two boxes in a master-master swap sync, conflict_strategy=latest, comprising around 39 tables in 3 databases of roughly 50k rows in all. We have it divided into 3 dbgroups and 4 syncs total for convenience. With some data in both boxes' databases, mostly in sync and our app layer turned off (no other db clients), we can start bucardo and end up with locks looking something like this:

   # psql -Upostgres -c "select pid,count(pid) from pg_locks group by
   pid order by count(pid)"
     pid  | count
   ------+-------
     7404 |     2
     6947 |    16
     6942 |    32
     6930 |    74
     6855 |    84
     6940 |  1678
   (6 rows)

That's 1600 locks just for bucardo. Notably this is only on one box; the other bucardo is using around 400 locks to perform the same sync. Looking for culprit queries above, that pid 6940 is mostly idle on different tables with no query:

   # psql -Upostgres -c "select database, relation, pid, mode,
   current_query  from pg_locks join  pg_stat_activity on (pid=procpid)"
   database | relation | pid  |      mode       |  current_query
     ----------+----------+------+------------------+----------------------
        27512 |    28671 | 6940 | SIReadLock      | <IDLE>
        27512 |    28524 | 6940 | SIReadLock      | <IDLE>
        27512 |    28562 | 6940 | SIReadLock      | <IDLE>
        27512 |    28671 | 6940 | SIReadLock      | <IDLE>
        27512 |    28726 | 6940 | SIReadLock      | <IDLE>
        27512 |    28634 | 6940 | SIReadLock      | <IDLE>
   .... etc ....

Now, if I start our app which will begin doing writes on both boxes, we'll see transient activity with our queries and with bucardo doing its COPY ... STDIN work but those queries do their work and go away. The lock count will rise over several minutes up to 20,000 or 30,000 on one box in similar manner. Around that point we'll start to hit this:

   (28706) [Fri Apr 12 18:28:54 2013] KID New kid, sync "o1_sync"
   alive=1 Parent=27035 PID=28706 kicked=1
   (28706) [Fri Apr 12 18:28:54 2013] KID DBD::Pg::db pg_result failed:
   ERROR:  out of shared memory HINT:  You might need to increase
   max_pred_locks_per_transaction. at /tpapp/tpdb/lib/perl5/Bucardo.pm
   line 3140. Line: 4801 Main DB state: ? Error: none DB source_ossgw
   state: 53200 Error: 7 DB target_ossgw state: ? Error: none
   (28706) [Fri Apr 12 18:28:54 2013] KID Kid 28706 exiting at
   cleanup_kid. Sync "o1_sync" public.commonobjects Reason: DBD::Pg::db
   pg_result failed: ERROR:  out of shared memory HINT:  You might need
   to increase max_pred_locks_per_transaction. at
   /tpapp/tpdb/lib/perl5/Bucardo.pm line 3140. Line: 4801 Main DB
   state: ? Error: none DB source_ossgw state: 53200 Error: 7 DB
target_ossgw state: ? Error: none and things start failing on random tables and queries everywhere for all clients. We've tried bumping the pg lock settings to no avail. We'd like to understand the lock usage which we're assuming is root cause here.

Any ideas appreciated.  Thanks!







Confidentiality Notice: This e-mail (including any attachments) is intended 
only for the recipients named above. It may contain confidential or privileged 
information and should not be read, copied or otherwise used by any other 
person. If you are not a named recipient, please notify the sender of that fact 
and delete the e-mail from your system.

_______________________________________________
Bucardo-general mailing list
[email protected]
https://mail.endcrypt.com/mailman/listinfo/bucardo-general

Reply via email to