OK, I was wrong.

 

It turns out that there *were* two slon processes for the same node.
How did I miss this fact?  Because they were running on two different
machines.  I assumed that all slon processes were running on the same
server (the master DB machine) so it would be sufficient to run a "ps
afx | grep slon" command on that server.  But of course this command
completely missed the slon process that was running on the other server.

 

How did I discover this?  The following query was helpful...

 

stingray_prod=# select nl.*, psa.client_addr from
_stingray_cluster.sl_nodelock nl join pg_stat_activity psa on
nl.nl_backendpid=psa.procpid;
 nl_nodeid | nl_conncnt | nl_backendpid | client_addr
-----------+------------+---------------+-------------
         2 |         11 |         14774 | (query-2)
         2 |         15 |         20569 | (query-2)
         1 |          0 |         30549 | (query-1)
         4 |         20 |         30566 | (query-1)
         4 |         21 |         30571 | (query-1)
(5 rows)

 

What are slon processes doing on query-2???  Ah-ha!

 

It goes to show that it's not sufficient to double check and quadruple
check, you have to go further.

 

--S

 

________________________________

From: Shahaf Abileah 
Sent: Friday, August 15, 2008 2:35 PM
To: slony1-general@lists.slony.info
Subject: duplicate key value violates unique constraint
"sl_nodelock-pkey"

 

Hello list,

 

I've been using Slony in production for months and things have worked
out pretty well so far.  But this past week I ran into problems when we
switched from Postgres 8.2 to Postgres 8.3 (still using Slony version
1.2.13).  I don't think the error results from the version upgrade, but
I could be wrong.

 

I set up the master database (query-1) and then proceeded to set up
replication to two slaves (query-2 and query-4).  I kicked off the slon
processes for all three simultaneously.  Replication to query-4 got
started right away and succeeded - it took 2 hours for the initial bulk
copy of data, and soon after that it was caught up with incremental
changes.  However, replication to query-2 died right away.  The slon
process for query-2 quit immediately and I saw the following message in
the log:

 

2008-08-15 13:57:01 PDT FATAL  localListenThread: "select
"_stingray_cluster".cleanupNodelock(); insert into
"_stingray_cluster".sl_nodelock values (    2, 0,
"pg_catalog".pg_backend_pid()); " - ERROR:  duplicate key value violates
unique constraint "sl_nodelock-pkey"

 

Now, I searched the interweb for solutions to this problem and I found
several suggestions, but none of them have helped so far...

 

1.      People suggest that this happens when you try to start multiple
slon processes for the same node at the same time.  I'm quite certain
that this is not the case.  I've checked and double-checked and
quadruple-checked.  For what it's worth, remember that I've successfully
run Slony replication to multiple slaves for months.

 

Here are the ones that are running:

 

25537 pts/3    S      0:00 slon -d 2 stingray_cluster
host=query-1.colo.redfin.com dbname=my_db user=my_user
password=my_password

25541 pts/3    Sl     0:00  \_ slon -d 2 stingray_cluster
host=query-1.colo.redfin.com dbname= my_db user= my_user password=
my_password

25544 pts/3    S      0:00 slon -d 2 stingray_cluster
host=query-4.colo.redfin.com dbname= my_db user= my_user password=
my_password

25545 pts/3    Sl     0:00  \_ slon -d 2 stingray_cluster
host=query-4.colo.redfin.com dbname= my_db user= my_user password=
my_password

 

            And the one that starts and then immediately dies has
query-2 as the host.

 

2.      Some people suggest that this happens when a slon process dies
and leaves a stale connection, a connection so deep in the network stack
that slony and postgres are not aware.  For this people suggest either
waiting about two hours for some timeout to kick in, or else having the
admin kill -2 the zombie process/connection.  Apparently this happens
more often if you try to run slony across a wan.

 

Well, our slon processes all run on the master node and the other nodes
are right there in the same LAN.  (we did once try to run Slony across a
WAN but that never worked)

 

This duplicate key issue first happened more than 12 hours ago, and it
continues to happen.  Each time I try to start a slon process for
query-2, it dies right away with the same error.

 

So, I don't think there's a zombie connection, but I also don't know
exactly how to check for one.  I can say that all the proc's in the
pg_stat_activity have a xact_start that is less than an hour old.

 

3.      One person mentioned having to go into the sl_nodelock table and
manually whack the bad row.  I'm a little concerned about doing this
kind of surgery because I don't know the sl_nodelock table and therefore
I don't know what would be the consequence of whacking a row (especially
if I accidentally whack the wrong row).

 

Here's what I see in that table:

 

stingray_prod=# select * from _stingray_cluster.sl_nodelock;

 nl_nodeid | nl_conncnt | nl_backendpid

-----------+------------+---------------

         2 |         11 |         14774

         2 |         15 |         20569

         1 |          0 |         25551

         4 |         16 |         25568

         4 |         17 |         25570

(5 rows)

 

What do these columns mean exactly?  Is there documentation on this
table somewhere?  Which one is likely to be the offending row?  Is it
crazy to whack the row manually?  Would I need to also whack
corresponding rows in other tables?

 

Any help with this would be much appreciated.

 

On a related note - is there a psql command (like \d) that I can use to
list all the objects in my _stingray_cluster schema?

 

Thanks,

 

--S

 

_______________________________________________
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general

Reply via email to