[FreeBSD] slony ile replikasyon problemi

Bayram KARAGOZ Wed, 25 Mar 2009 04:58:23 -0700

merhaba

freebsd üzerine kurulu iki farklı sunucudaki iki ayrı postgresql
veritabanını slony ile replikasyon işlemine tabii tuttum. replikasyonun
yapılmasında herhangi bir problem yok. master sunucuda yaptığım her
türlü işlem anında slave veritabanına da ekleniyor. fakat bir süre sonra
master sunucuda cpu ve io yükselmesi meydana geliyor. süreçleri
incelediğimde master üzerinde çalışan bir processin bu probleme sebep
olduğunun farkına vardım. bu süreç sürekli çalıştığı zaman veritabanına
yapılan diğer sorgularda takılmalar meydana geliyor ve bazen sorgu
cevapları çok geç geliyor. bu süreci sonlandırdığım zaman tekrar sunucu
eski haline dönüyor fakat replikasyon işlemi hata veriyor. çalışan süreç
aşağıda gösterilmektedir.


ön bilgi;

master sunucu ip : 10.0.10.231
slave sunucu ip    : 10.0.10.232



sippy=# SELECT procpid,client_addr, current_query,query_start from 
pg_stat_activity where current_query NOT LIKE '<IDLE>';
 procpid | client_addr |                                                   
current_query                                                    |          
query_start
---------+-------------+--------------------------------------------------------------------------------------------------------------------+-------------------------------
   84329 | 10.0.10.232 | fetch 100 from LOG;                                    
                                                            | 2009-03-25 
13:05:57.007739+02



yukarıda gösterilen süreç slave sunucu tarafından gönderilen bir istek
olduğu client_addr kısmında gösteriliyor. ( fetch 100 from LOG; )
sorgusuyla alakalı slony FAQ da araştırma yaptığımda bu problemin birkaç
sebepten kaynaklanabileceği söyleniyor;

>>>>>>>>>>> 

14. Some nodes start consistently falling behind

I have been running Slony-I on a node for a while, and am seeing system 
performance suffering.

I'm seeing long running queries of the form:

        fetch 100 from LOG;

This can be characteristic of pg_listener (which is the table containing NOTIFY 
data) having plenty of dead tuples in it. That makes NOTIFY events take a long 
time, and causes the affected node to gradually fall further and further behind.

You quite likely need to do a VACUUM FULL on pg_listener, to vigorously clean 
it out, and need to vacuum pg_listener really frequently. Once every five 
minutes would likely be AOK.

Slon daemons already vacuum a bunch of tables, and cleanup_thread.c contains a 
list of tables that are frequently vacuumed automatically. In Slony-I 1.0.2, 
pg_listener is not included. In 1.0.5 and later, it is regularly vacuumed, so 
this should cease to be a direct issue.

There is, however, still a scenario where this will still "bite." Under MVCC, 
vacuums cannot delete tuples that were made "obsolete" at any time after the 
start time of the eldest transaction that is still open. Long running 
transactions will cause trouble, and should be avoided, even on subscriber 
nodes.

<<<<<<<<<<<<<< 
anladığım kadarıyla replikasyon işlemiyle alakalı NOTIFY bilgileri
sürekli olarak pg_listener tablosunda tutuluyor ve güncelleniyor. bu
sebepten bu tabloda dead tuple lar oluşuyor. bu yüzden bu tablonun
yaklaşık 5 dk bir vacuum lanması gerektiği söyleniyor. fakat slony 1.0.5
ve sonra versiyonlar bu işi kendisi yapıyor deniyor. benim kullandığım
slony versiyonu 1.2.11. bu yüzden bu ihtimalin olması imkansız gibi
görünüyor.


sippy=# SELECT * from pg_listener ;
   relname    | listenerpid | notification
--------------+-------------+--------------
 _ssp_Restart |       85920 |            0
(1 row)




pkg_version -v |grep slony
slony1-1.2.11                       <   needs updating (port has 1.2.15)



pkg_version -v |grep postgre*
postgresql-client-8.2.5_1           <   needs updating (port has 8.2.12)
postgresql-server-8.2.5_2           <   needs updating (port has 8.2.12)



2. neden olarak aşağıdaki ihtimalden bahsediliyor; 



>>>>>>>>>>>>>>>
25. Replication has been slowing down, I'm seeing FETCH 100 FROM LOG queries 
running for a long time, sl_log_1 is growing, and performance is, well, 
generally getting steadily worse. 


There are actually a number of possible causes for this sort of thing. There is 
a question involving similar pathology where the problem is that pg_listener 
grows because it is not vacuumed. 

Another " proximate cause " for this growth is for there to be a connection 
connected to the node that sits IDLE IN TRANSACTION for a very long time. 

That open transaction will have multiple negative effects, all of which will 
adversely affect performance:

Vacuums on all tables, including pg_listener, will not clear out dead tuples 
from before the start of the idle transaction. 

        

The cleanup thread will be unable to clean out entries in sl_log_1 and 
sl_seqlog, with the result that these tables will grow, ceaselessly, until the 
transaction is closed. 

        

You can monitor for this condition inside the database only if the PostgreSQL 
postgresql.conf parameter stats_command_string is set to true. If that is set, 
then you may submit the query select * from pg_stat_activity where 
current_query like '%IDLE% in transaction'; which will find relevant activity.


You should also be able to search for " idle in transaction " in the process 
table to find processes that are thus holding on to an ancient transaction. 


It is also possible (though rarer) for the problem to be a transaction that is, 
for some other reason, being held open for a very long time. The query_start 
time in pg_stat_activity may show you some query that has been running way too 
long. 


There are plans for PostgreSQL to have a timeout parameter, 
open_idle_transaction_timeout , which would cause old transactions to time out 
after some period of disuse. 

Buggy connection pool logic is a common culprit for this sort of thing. There 
are plans for pgpool to provide a better alternative, eventually, where 
connections would be shared inside a connection pool implemented in C. You may 
have some more or less buggy connection pool in your Java or PHP application; 
if a small set of real connections are held in pgpool, that will hide from the 
database the fact that the application imagines that numerous of them are left 
idle in transaction for hours at a time.

<<<<<<<<<<<<<<<<<<<<


buradaki ihtimalde IDLE IN TRANSACTION süreçlerinin uzun süreli olarak
devam etmesinden kaynaklanabileceğinden bahsediliyor. bu ihtimale
istinaden veritabanında inceleme yaptığımda gerçekten bu süreçlerin
sayısının fazla olduğunu gördüm. fakat bu süreçleri sonlandırabileceğim
yukarıda bahsedilen open_idle_transaction_timeout süresi henüz
postgresql.conf da yok.


sippy=# SELECT procpid,client_addr, current_query,query_start from 
pg_stat_activity where current_query NOT LIKE '<IDLE>';
 procpid | client_addr |                                                   
current_query                                                    |          
query_start          
---------+-------------+--------------------------------------------------------------------------------------------------------------------+-------------------------------
     918 | 10.0.10.232 | <IDLE> in transaction                                  
                                                            | 2009-03-25 
12:06:15.382207+02 
     779 |             | <IDLE> in transaction                                  
                                                            | 2009-03-23 
11:03:19.764812+02 
     780 |             | <IDLE> in transaction                                  
                                                            | 2009-03-23 
11:03:19.767793+02 
     781 |             | <IDLE> in transaction                                  
                                                            | 2009-03-23 
11:03:19.770713+02 
     782 |             | <IDLE> in transaction                                  
                                                            | 2009-03-23 
11:03:19.773612+02 
     783 |             | <IDLE> in transaction                                  
                                                            | 2009-03-23 
11:03:19.776419+02 
     887 |             | <IDLE> in transaction                                  
                                                            | 2009-03-23 
11:03:23.910591+02 
   52027 |             | SELECT procpid,client_addr, current_query,query_start 
from pg_stat_activity where current_query NOT LIKE '<IDLE>'; | 2009-03-25 
12:06:38.773785+02 



bu konuda çözüm olarak ne yapabilirim. 2. ihtimalde pgpool adında
postgresql için connection pool hakkında tooldan bahsedilmiş. biraz
inceledim bu uygulamayı denemeden önce sizlere sormak istedim. ayrıca 2.
ihtimaldeki çözümleri tam anlamamışta olabilirim. fikirlerinizi
bekliyorum. şimdiden teşekkürler.

[FreeBSD] slony ile replikasyon problemi

Cevap