subject:"\[Slony1\-general\] slony1 drop node failure"

Re: [Slony1-general] slony1 drop node failure

2018-03-02 Thread Tignor, Tom


Checked the logs for our other replicas. Details below. With a 6-7 sec 
gap, it’s less clear how the problem would have come from those.
Reproducing this would probably be hard. This is the first time it’s 
ever happened for us. On the theory there is a race condition, one way to 
reproduce would be to hack the code to delay the SYNC processing thread at just 
the wrong time so it doesn’t land until disableNode processing is complete.


018-02-21 19:19:47 UTC [32652] INFO   remoteWorkerThread_8: SYNC 5002075962 
done in 0.479 seconds
...
2018-02-21 19:19:50 UTC [32652] CONFIG disableNode: no_id=8

2018-02-21 19:19:46 UTC [9360] INFO   remoteWorkerThread_8: SYNC 5002075962 
done in 3.979 seconds
...
2018-02-21 19:19:52 UTC [9360] CONFIG disableNode: no_id=8

2018-02-21 19:19:48 UTC [7420] INFO   remoteWorkerThread_8: SYNC 5002075962 
done in 2.445 seconds
...
2018-02-21 19:19:55 UTC [7420] CONFIG disableNode: no_id=8


Tom(


On 2/28/18, 10:28 PM, "Steve Singer"  wrote:

On Mon, 26 Feb 2018, Tignor, Tom wrote:

>
>   In the slony1 log of our primary host (the same one which later showed 
the bug) we had:
>
> 2018-02-21 19:19:49 UTC [22582] INFO   remoteWorkerThread_8: SYNC 
5002075962 done in 1.725 seconds
> 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8
>
>   The timestamp of the problem event was somewhat earlier: 2018-02-21 
> 19:19:41.
>   To me it looks like there is a race condition and despite the order 
> of the log events, the DROP NODE was processed first and the SYNC was 
> accepted later, and thus created an sl_event record. Of course, that’s 
> simply a guess.

This might be possible.

If the drop being droped is 8, and the event node from the drop is 4.
Some third node say 3 might have a sequence like this

remoteWorkerThread_8: Starts processing SYNC
remoteWorkerThread_4: process drop node. Updates runtime config
remoteWorkerThread_8: finishes processing sync COMMITS

I'd feel more comfortable if we had a way of reproducing this though.

One option might be to have the remoteWorkerThread recheck the runtime 
config before the COMMIT to make sure that the node is still active.




>
>   Tom(
>
>
> On 2/26/18, 12:01 PM, "Steve Singer"  wrote:
>
>On Mon, 26 Feb 2018, Tignor, Tom wrote:
>
>>
>>  Thanks. I see the deletes added for sl_seqlog and 
sl_log_script. The
>> constraint violation appearing in the errors was for sl_event. Do we
>> expect these changes fully remove all state, including sl_event? The
>> checkNodeDeleted function doesn’t look at sl_event.
>
>It could be that this is a different issue then.
>That function (the dropNode_int) should have removed the rows from 
sl_event.
>The question the becomes did it not remove them for some reason, or 
did they
>get added back later, and if so how?
>
>
>>  Tom(
>>
>>
>> On 2/26/18, 11:03 AM, "Steve Singer"  wrote:
>>
>>On Mon, 26 Feb 2018, Tignor, Tom wrote:
>>
>>You can get it from the github branch (latest commit) at
>>https://github.com/ssinger/slony1-engine/tree/bug375
>>
>>
>>
>>
>>>
>>> Steve,
>>> The patch link actually goes to a page which says 
“Bugzilla is down for maintenance.” Is there a way to see the patch currently? 
Does it exist or is it scheduled in some Slony-I release?
>>>
>>> Tom(
>>>
>>>
>>> On 2/22/18, 6:06 PM, "Steve Singer"  wrote:
>>>
>>>On Thu, 22 Feb 2018, Tignor, Tom wrote:
>>>
>>>Looks like?
>>>
http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html
>>>
>>>I can't remember if that was what prompted
>>>
http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html
>>>
>>>
>>>https://github.com/ssinger/slony1-engine/tree/bug375
>>>
>>>I can't seem to find a reason why this wasn't committed.
>>>
>>>
>>>>
>>>>
>>>>
>>>> Hello slony1 community,
>>>>
>>>> We have a head scratcher here. It appears 
a DROP NODE command was not fully processed. The
>>>> command was issued and confirmed on all our nodes at 
approximately 2018-02-21 19:19:50 UTC. When we went to
>>>> restore it over two hours later, all

Re: [Slony1-general] slony1 drop node failure

2018-02-28 Thread Steve Singer


On Mon, 26 Feb 2018, Tignor, Tom wrote:



In the slony1 log of our primary host (the same one which later showed 
the bug) we had:

2018-02-21 19:19:49 UTC [22582] INFO   remoteWorkerThread_8: SYNC 5002075962 
done in 1.725 seconds
2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8

	The timestamp of the problem event was somewhat earlier: 2018-02-21 
19:19:41.
	To me it looks like there is a race condition and despite the order 
of the log events, the DROP NODE was processed first and the SYNC was 
accepted later, and thus created an sl_event record. Of course, that’s 
simply a guess.


This might be possible.

If the drop being droped is 8, and the event node from the drop is 4.
Some third node say 3 might have a sequence like this

remoteWorkerThread_8: Starts processing SYNC
remoteWorkerThread_4: process drop node. Updates runtime config
remoteWorkerThread_8: finishes processing sync COMMITS

I'd feel more comfortable if we had a way of reproducing this though.

One option might be to have the remoteWorkerThread recheck the runtime 
config before the COMMIT to make sure that the node is still active.







Tom(


On 2/26/18, 12:01 PM, "Steve Singer"  wrote:

   On Mon, 26 Feb 2018, Tignor, Tom wrote:

   >
   > Thanks. I see the deletes added for sl_seqlog and sl_log_script. 
The
   > constraint violation appearing in the errors was for sl_event. Do we
   > expect these changes fully remove all state, including sl_event? The
   > checkNodeDeleted function doesn’t look at sl_event.

   It could be that this is a different issue then.
   That function (the dropNode_int) should have removed the rows from sl_event.
   The question the becomes did it not remove them for some reason, or did they
   get added back later, and if so how?


   > Tom(
   >
   >
   > On 2/26/18, 11:03 AM, "Steve Singer"  wrote:
   >
   >On Mon, 26 Feb 2018, Tignor, Tom wrote:
   >
   >You can get it from the github branch (latest commit) at
   >https://github.com/ssinger/slony1-engine/tree/bug375
   >
   >
   >
   >
   >>
   >> Steve,
   >> The patch link actually goes to a page which says “Bugzilla 
is down for maintenance.” Is there a way to see the patch currently? Does it exist or 
is it scheduled in some Slony-I release?
   >>
   >> Tom(
   >>
   >>
   >> On 2/22/18, 6:06 PM, "Steve Singer"  wrote:
   >>
   >>On Thu, 22 Feb 2018, Tignor, Tom wrote:
   >>
   >>Looks like?
   >>
http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html
   >>
   >>I can't remember if that was what prompted
   >>
http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html
   >>
   >>
   >>https://github.com/ssinger/slony1-engine/tree/bug375
   >>
   >>I can't seem to find a reason why this wasn't committed.
   >>
   >>
   >>>
   >>>
   >>>
   >>> Hello slony1 community,
   >>>
   >>> We have a head scratcher here. It appears a DROP 
NODE command was not fully processed. The
   >>> command was issued and confirmed on all our nodes at 
approximately 2018-02-21 19:19:50 UTC. When we went to
   >>> restore it over two hours later, all replication stopped on an 
sl_event constraint violation. Investigation
   >>> showed a SYNC event for the dropped node with a timestamp of 
just a few seconds before the drop. I believe this
   >>> is a first for us. The DROP NODE command is supposed to remove 
all state for the dropped node. Is that right? Is
   >>> there a potential race condition somewhere which could leave 
behind state?
   >>>
   >>> Thanks in advance,
   >>>
   >>>
   >>>
   >>>  master log replication freeze error 
   >>>
   >>> 2018-02-21 21:38:52 UTC [5775] ERROR  remoteWorkerThread_8: "insert into 
"_ams_cluster".sl_event (ev_origin,
   >>> ev_seqno, ev_timestamp,  ev_snapshot, ev\
   >>>
   >>> _type ) values ('8', '5002075962', '2018-02-21 
19:19:41.958719+00', '87044110:87044110:', 'SYNC'); insert
   >>> into "_ams_cluster".sl_confirm   (con_origi\
   >>>
   >>> n, con_received, con_seqno, con_timestamp)values (8, 1, 
'5002075962', now()); select
   >>> "_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139 
s'::inter\
   >>>
   >>> val); commit transaction;" PGRES_FATAL_ERROR ERROR:  duplicate 
key value violates unique constraint
   >>> "sl_event-pkey"
   >>>
   >>> DETAIL:  Key (ev_origin, ev_seqno)=(8, 5002075962) already 
exists.
   >>>
   >>> 2018-02-21 21:38:52 UTC [13649] CONFIG slon: child

Re: [Slony1-general] slony1 drop node failure

2018-02-26 Thread Tignor, Tom


In the slony1 log of our primary host (the same one which later showed 
the bug) we had:

2018-02-21 19:19:49 UTC [22582] INFO   remoteWorkerThread_8: SYNC 5002075962 
done in 1.725 seconds
2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8

The timestamp of the problem event was somewhat earlier: 2018-02-21 
19:19:41.
To me it looks like there is a race condition and despite the order of 
the log events, the DROP NODE was processed first and the SYNC was accepted 
later, and thus created an sl_event record. Of course, that’s simply a guess.

Tom(


On 2/26/18, 12:01 PM, "Steve Singer"  wrote:

On Mon, 26 Feb 2018, Tignor, Tom wrote:

>
>   Thanks. I see the deletes added for sl_seqlog and sl_log_script. The 
> constraint violation appearing in the errors was for sl_event. Do we 
> expect these changes fully remove all state, including sl_event? The 
> checkNodeDeleted function doesn’t look at sl_event.

It could be that this is a different issue then.
That function (the dropNode_int) should have removed the rows from 
sl_event. 
The question the becomes did it not remove them for some reason, or did 
they 
get added back later, and if so how?


>   Tom(
>
>
> On 2/26/18, 11:03 AM, "Steve Singer"  wrote:
>
>On Mon, 26 Feb 2018, Tignor, Tom wrote:
>
>You can get it from the github branch (latest commit) at
>https://github.com/ssinger/slony1-engine/tree/bug375
>
>
>
>
>>
>>  Steve,
>>  The patch link actually goes to a page which says “Bugzilla is 
down for maintenance.” Is there a way to see the patch currently? Does it exist 
or is it scheduled in some Slony-I release?
>>
>>  Tom(
>>
>>
>> On 2/22/18, 6:06 PM, "Steve Singer"  wrote:
>>
>>On Thu, 22 Feb 2018, Tignor, Tom wrote:
>>
>>Looks like?
>>
http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html
>>
>>I can't remember if that was what prompted
>>
http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html
>>
>>
>>https://github.com/ssinger/slony1-engine/tree/bug375
>>
>>I can't seem to find a reason why this wasn't committed.
>>
>>
>>>
>>>
>>>
>>> Hello slony1 community,
>>>
>>> We have a head scratcher here. It appears a 
DROP NODE command was not fully processed. The
>>> command was issued and confirmed on all our nodes at 
approximately 2018-02-21 19:19:50 UTC. When we went to
>>> restore it over two hours later, all replication stopped on an 
sl_event constraint violation. Investigation
>>> showed a SYNC event for the dropped node with a timestamp of 
just a few seconds before the drop. I believe this
>>> is a first for us. The DROP NODE command is supposed to remove 
all state for the dropped node. Is that right? Is
>>> there a potential race condition somewhere which could leave 
behind state?
>>>
>>> Thanks in advance,
>>>
>>>
>>>
>>>  master log replication freeze error 
>>>
>>> 2018-02-21 21:38:52 UTC [5775] ERROR  remoteWorkerThread_8: 
"insert into "_ams_cluster".sl_event (ev_origin,
>>> ev_seqno, ev_timestamp,  ev_snapshot, ev\
>>>
>>> _type ) values ('8', '5002075962', '2018-02-21 
19:19:41.958719+00', '87044110:87044110:', 'SYNC'); insert
>>> into "_ams_cluster".sl_confirm   (con_origi\
>>>
>>> n, con_received, con_seqno, con_timestamp)values (8, 1, 
'5002075962', now()); select
>>> "_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139 
s'::inter\
>>>
>>> val); commit transaction;" PGRES_FATAL_ERROR ERROR:  duplicate 
key value violates unique constraint
>>> "sl_event-pkey"
>>>
>>> DETAIL:  Key (ev_origin, ev_seqno)=(8, 5002075962) already 
exists.
>>>
>>> 2018-02-21 21:38:52 UTC [13649] CONFIG slon: child terminated 
signal: 9; pid: 5775, current worker pid: 5775
>>>
>>> 2018-02-21 21:38:52 UTC [13649] CONFIG slon: restart of worker 
in 10 seconds
>>>
>>>  master log replication freeze error 
>>>
>>>
>>>
>>>
>>>
>>>  master DB leftover event 
>>>
>>> a...@ams6.cmb.netmgmt:~$ psql -U akamai -d ams
>>>
>>>

Re: [Slony1-general] slony1 drop node failure

2018-02-26 Thread Steve Singer


On Mon, 26 Feb 2018, Tignor, Tom wrote:



	Thanks. I see the deletes added for sl_seqlog and sl_log_script. The 
constraint violation appearing in the errors was for sl_event. Do we 
expect these changes fully remove all state, including sl_event? The 
checkNodeDeleted function doesn’t look at sl_event.


It could be that this is a different issue then.
That function (the dropNode_int) should have removed the rows from sl_event. 
The question the becomes did it not remove them for some reason, or did they 
get added back later, and if so how?




Tom(


On 2/26/18, 11:03 AM, "Steve Singer"  wrote:

   On Mon, 26 Feb 2018, Tignor, Tom wrote:

   You can get it from the github branch (latest commit) at
   https://github.com/ssinger/slony1-engine/tree/bug375




   >
   > Steve,
   > The patch link actually goes to a page which says “Bugzilla is 
down for maintenance.” Is there a way to see the patch currently? Does it exist or 
is it scheduled in some Slony-I release?
   >
   > Tom(
   >
   >
   > On 2/22/18, 6:06 PM, "Steve Singer"  wrote:
   >
   >On Thu, 22 Feb 2018, Tignor, Tom wrote:
   >
   >Looks like?
   >
http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html
   >
   >I can't remember if that was what prompted
   >
http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html
   >
   >
   >https://github.com/ssinger/slony1-engine/tree/bug375
   >
   >I can't seem to find a reason why this wasn't committed.
   >
   >
   >>
   >>
   >>
   >> Hello slony1 community,
   >>
   >> We have a head scratcher here. It appears a DROP NODE 
command was not fully processed. The
   >> command was issued and confirmed on all our nodes at approximately 
2018-02-21 19:19:50 UTC. When we went to
   >> restore it over two hours later, all replication stopped on an 
sl_event constraint violation. Investigation
   >> showed a SYNC event for the dropped node with a timestamp of just a 
few seconds before the drop. I believe this
   >> is a first for us. The DROP NODE command is supposed to remove all 
state for the dropped node. Is that right? Is
   >> there a potential race condition somewhere which could leave behind 
state?
   >>
   >> Thanks in advance,
   >>
   >>
   >>
   >>  master log replication freeze error 
   >>
   >> 2018-02-21 21:38:52 UTC [5775] ERROR  remoteWorkerThread_8: "insert into 
"_ams_cluster".sl_event (ev_origin,
   >> ev_seqno, ev_timestamp,  ev_snapshot, ev\
   >>
   >> _type ) values ('8', '5002075962', '2018-02-21 
19:19:41.958719+00', '87044110:87044110:', 'SYNC'); insert
   >> into "_ams_cluster".sl_confirm   (con_origi\
   >>
   >> n, con_received, con_seqno, con_timestamp)values (8, 1, 
'5002075962', now()); select
   >> "_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139 s'::inter\
   >>
   >> val); commit transaction;" PGRES_FATAL_ERROR ERROR:  duplicate key 
value violates unique constraint
   >> "sl_event-pkey"
   >>
   >> DETAIL:  Key (ev_origin, ev_seqno)=(8, 5002075962) already exists.
   >>
   >> 2018-02-21 21:38:52 UTC [13649] CONFIG slon: child terminated signal: 
9; pid: 5775, current worker pid: 5775
   >>
   >> 2018-02-21 21:38:52 UTC [13649] CONFIG slon: restart of worker in 10 
seconds
   >>
   >>  master log replication freeze error 
   >>
   >>
   >>
   >>
   >>
   >>  master DB leftover event 
   >>
   >> a...@ams6.cmb.netmgmt:~$ psql -U akamai -d ams
   >>
   >> psql (9.1.24)
   >>
   >> Type "help" for help.
   >>
   >>
   >>
   >> ams=# select * from sl_event_bak;
   >>
   >>  ev_origin |  ev_seqno  | ev_timestamp  |
ev_snapshot | ev_type | ev_data1 | ev_data2 |
   >> ev_data3 | ev_data4 | ev_data5 | ev_data6 | ev_
   >>
   >> data7 | ev_data8
   >>
   >> 
---++---++-+--+--+-
   >> -+--+--+--+
   >>
   >> --+--
   >>
   >>  8 | 5002075962 | 2018-02-21 19:19:41.958719+00 | 
87044110:87044110: | SYNC|  |  |
   >> |  |  |  |
   >>
   >>   |
   >>
   >> (1 row)
   >>
   >>
   >>
   >> ams=#
   >>
   >>  master DB leftover event 
   >>
   >>
   >>
   >>  master log drop node record 
   >>
   >> 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8
   >>
   >> 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=4 
li_receiver=1 li_provider=4
   >

Re: [Slony1-general] slony1 drop node failure

2018-02-26 Thread Tignor, Tom


Thanks. I see the deletes added for sl_seqlog and sl_log_script. The 
constraint violation appearing in the errors was for sl_event. Do we expect 
these changes fully remove all state, including sl_event? The checkNodeDeleted 
function doesn’t look at sl_event.

Tom(


On 2/26/18, 11:03 AM, "Steve Singer"  wrote:

On Mon, 26 Feb 2018, Tignor, Tom wrote:

You can get it from the github branch (latest commit) at 
https://github.com/ssinger/slony1-engine/tree/bug375




>
>   Steve,
>   The patch link actually goes to a page which says “Bugzilla is down for 
maintenance.” Is there a way to see the patch currently? Does it exist or is it 
scheduled in some Slony-I release?
>
>   Tom(
>
>
> On 2/22/18, 6:06 PM, "Steve Singer"  wrote:
>
>On Thu, 22 Feb 2018, Tignor, Tom wrote:
>
>Looks like?
>
http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html
>
>I can't remember if that was what prompted
>
http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html
>
>
>https://github.com/ssinger/slony1-engine/tree/bug375
>
>I can't seem to find a reason why this wasn't committed.
>
>
>>
>>
>>
>> Hello slony1 community,
>>
>> We have a head scratcher here. It appears a DROP 
NODE command was not fully processed. The
>> command was issued and confirmed on all our nodes at approximately 
2018-02-21 19:19:50 UTC. When we went to
>> restore it over two hours later, all replication stopped on an 
sl_event constraint violation. Investigation
>> showed a SYNC event for the dropped node with a timestamp of just a 
few seconds before the drop. I believe this
>> is a first for us. The DROP NODE command is supposed to remove all 
state for the dropped node. Is that right? Is
>> there a potential race condition somewhere which could leave behind 
state?
>>
>> Thanks in advance,
>>
>>
>>
>>  master log replication freeze error 
>>
>> 2018-02-21 21:38:52 UTC [5775] ERROR  remoteWorkerThread_8: "insert 
into "_ams_cluster".sl_event (ev_origin,
>> ev_seqno, ev_timestamp,  ev_snapshot, ev\
>>
>> _type ) values ('8', '5002075962', '2018-02-21 
19:19:41.958719+00', '87044110:87044110:', 'SYNC'); insert
>> into "_ams_cluster".sl_confirm   (con_origi\
>>
>> n, con_received, con_seqno, con_timestamp)values (8, 1, 
'5002075962', now()); select
>> "_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139 s'::inter\
>>
>> val); commit transaction;" PGRES_FATAL_ERROR ERROR:  duplicate key 
value violates unique constraint
>> "sl_event-pkey"
>>
>> DETAIL:  Key (ev_origin, ev_seqno)=(8, 5002075962) already exists.
>>
>> 2018-02-21 21:38:52 UTC [13649] CONFIG slon: child terminated 
signal: 9; pid: 5775, current worker pid: 5775
>>
>> 2018-02-21 21:38:52 UTC [13649] CONFIG slon: restart of worker in 10 
seconds
>>
>>  master log replication freeze error 
>>
>>
>>
>>
>>
>>  master DB leftover event 
>>
>> a...@ams6.cmb.netmgmt:~$ psql -U akamai -d ams
>>
>> psql (9.1.24)
>>
>> Type "help" for help.
>>
>>
>>
>> ams=# select * from sl_event_bak;
>>
>>  ev_origin |  ev_seqno  | ev_timestamp  |
ev_snapshot | ev_type | ev_data1 | ev_data2 |
>> ev_data3 | ev_data4 | ev_data5 | ev_data6 | ev_
>>
>> data7 | ev_data8
>>
>> 
---++---++-+--+--+-
>> -+--+--+--+
>>
>> --+--
>>
>>  8 | 5002075962 | 2018-02-21 19:19:41.958719+00 | 
87044110:87044110: | SYNC|  |  |
>> |  |  |  |
>>
>>   |
>>
>> (1 row)
>>
>>
>>
>> ams=#
>>
>>  master DB leftover event 
>>
>>
>>
>>  master log drop node record 
>>
>> 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8
>>
>> 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=4 
li_receiver=1 li_provider=4
>>
>> 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=7 
li_receiver=1 li_provider=7
>>
>> 2018-02-21 19:19:50 UTC [22582] CONFIG

Re: [Slony1-general] slony1 drop node failure

2018-02-26 Thread Steve Singer


On Mon, 26 Feb 2018, Tignor, Tom wrote:

You can get it from the github branch (latest commit) at 
https://github.com/ssinger/slony1-engine/tree/bug375







Steve,
The patch link actually goes to a page which says “Bugzilla is down for 
maintenance.” Is there a way to see the patch currently? Does it exist or is it 
scheduled in some Slony-I release?

Tom(


On 2/22/18, 6:06 PM, "Steve Singer"  wrote:

   On Thu, 22 Feb 2018, Tignor, Tom wrote:

   Looks like?
   http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html

   I can't remember if that was what prompted
   http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html


   https://github.com/ssinger/slony1-engine/tree/bug375

   I can't seem to find a reason why this wasn't committed.


   >
   >
   >
   > Hello slony1 community,
   >
   > We have a head scratcher here. It appears a DROP NODE 
command was not fully processed. The
   > command was issued and confirmed on all our nodes at approximately 
2018-02-21 19:19:50 UTC. When we went to
   > restore it over two hours later, all replication stopped on an sl_event 
constraint violation. Investigation
   > showed a SYNC event for the dropped node with a timestamp of just a few 
seconds before the drop. I believe this
   > is a first for us. The DROP NODE command is supposed to remove all state 
for the dropped node. Is that right? Is
   > there a potential race condition somewhere which could leave behind state?
   >
   > Thanks in advance,
   >
   >
   >
   >  master log replication freeze error 
   >
   > 2018-02-21 21:38:52 UTC [5775] ERROR  remoteWorkerThread_8: "insert into 
"_ams_cluster".sl_event (ev_origin,
   > ev_seqno, ev_timestamp,  ev_snapshot, ev\
   >
   > _type ) values ('8', '5002075962', '2018-02-21 19:19:41.958719+00', 
'87044110:87044110:', 'SYNC'); insert
   > into "_ams_cluster".sl_confirm   (con_origi\
   >
   > n, con_received, con_seqno, con_timestamp)values (8, 1, '5002075962', 
now()); select
   > "_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139 s'::inter\
   >
   > val); commit transaction;" PGRES_FATAL_ERROR ERROR:  duplicate key value 
violates unique constraint
   > "sl_event-pkey"
   >
   > DETAIL:  Key (ev_origin, ev_seqno)=(8, 5002075962) already exists.
   >
   > 2018-02-21 21:38:52 UTC [13649] CONFIG slon: child terminated signal: 9; 
pid: 5775, current worker pid: 5775
   >
   > 2018-02-21 21:38:52 UTC [13649] CONFIG slon: restart of worker in 10 
seconds
   >
   >  master log replication freeze error 
   >
   >
   >
   >
   >
   >  master DB leftover event 
   >
   > a...@ams6.cmb.netmgmt:~$ psql -U akamai -d ams
   >
   > psql (9.1.24)
   >
   > Type "help" for help.
   >
   >
   >
   > ams=# select * from sl_event_bak;
   >
   >  ev_origin |  ev_seqno  | ev_timestamp  |ev_snapshot   
  | ev_type | ev_data1 | ev_data2 |
   > ev_data3 | ev_data4 | ev_data5 | ev_data6 | ev_
   >
   > data7 | ev_data8
   >
   > 
---++---++-+--+--+-
   > -+--+--+--+
   >
   > --+--
   >
   >  8 | 5002075962 | 2018-02-21 19:19:41.958719+00 | 
87044110:87044110: | SYNC|  |  |
   > |  |  |  |
   >
   >   |
   >
   > (1 row)
   >
   >
   >
   > ams=#
   >
   >  master DB leftover event 
   >
   >
   >
   >  master log drop node record 
   >
   > 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8
   >
   > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=4 
li_receiver=1 li_provider=4
   >
   > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=7 
li_receiver=1 li_provider=7
   >
   > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=3 
li_receiver=1 li_provider=3
   >
   > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: update 
provider configuration
   >
   > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection 
for provider 4 terminated
   >
   > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: disconnecting 
from data provider 4
   >
   > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection 
for provider 7 terminated
   >
   >  master log drop node record 
   >
   >
   >
   >  replica log drop node record 
   >
   > 2018-02-21 19:19:51 UTC [22650] WARN   remoteWorkerThread_1: got DROP NODE 
for local node ID
   >
   > NOTICE:  Slony-I: Please drop schema "_ams_cluster"
   >
   > 2018-02-21 19:19:53 UTC [22650] INFO   remoteWorkerThread_7: SYNC 
5001868819 done in 2.153 seconds
   >
   > NOTICE:  drop cascades to 243 other objects
   >
   > DETAIL:  drop cascades to table _ams_cluster.sl_node
   >
   > drop cascades to table

Re: [Slony1-general] slony1 drop node failure

2018-02-26 Thread Tignor, Tom


Steve,
The patch link actually goes to a page which says “Bugzilla is down for 
maintenance.” Is there a way to see the patch currently? Does it exist or is it 
scheduled in some Slony-I release?

Tom(


On 2/22/18, 6:06 PM, "Steve Singer"  wrote:

On Thu, 22 Feb 2018, Tignor, Tom wrote:

Looks like?
http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html

I can't remember if that was what prompted 
http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html


https://github.com/ssinger/slony1-engine/tree/bug375

I can't seem to find a reason why this wasn't committed.


> 
>  
> 
> Hello slony1 community,
> 
> We have a head scratcher here. It appears a DROP NODE 
command was not fully processed. The
> command was issued and confirmed on all our nodes at approximately 
2018-02-21 19:19:50 UTC. When we went to
> restore it over two hours later, all replication stopped on an sl_event 
constraint violation. Investigation
> showed a SYNC event for the dropped node with a timestamp of just a few 
seconds before the drop. I believe this
> is a first for us. The DROP NODE command is supposed to remove all state 
for the dropped node. Is that right? Is
> there a potential race condition somewhere which could leave behind state?
> 
> Thanks in advance,
> 
>  
> 
>  master log replication freeze error 
> 
> 2018-02-21 21:38:52 UTC [5775] ERROR  remoteWorkerThread_8: "insert into 
"_ams_cluster".sl_event (ev_origin,
> ev_seqno, ev_timestamp,  ev_snapshot, ev\
> 
> _type ) values ('8', '5002075962', '2018-02-21 19:19:41.958719+00', 
'87044110:87044110:', 'SYNC'); insert
> into "_ams_cluster".sl_confirm   (con_origi\
> 
> n, con_received, con_seqno, con_timestamp)values (8, 1, '5002075962', 
now()); select
> "_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139 s'::inter\
> 
> val); commit transaction;" PGRES_FATAL_ERROR ERROR:  duplicate key value 
violates unique constraint
> "sl_event-pkey"
> 
> DETAIL:  Key (ev_origin, ev_seqno)=(8, 5002075962) already exists.
> 
> 2018-02-21 21:38:52 UTC [13649] CONFIG slon: child terminated signal: 9; 
pid: 5775, current worker pid: 5775
> 
> 2018-02-21 21:38:52 UTC [13649] CONFIG slon: restart of worker in 10 
seconds
> 
>  master log replication freeze error 
> 
>  
> 
>  
> 
>  master DB leftover event 
> 
> a...@ams6.cmb.netmgmt:~$ psql -U akamai -d ams
> 
> psql (9.1.24)
> 
> Type "help" for help.
> 
>  
> 
> ams=# select * from sl_event_bak;
> 
>  ev_origin |  ev_seqno  | ev_timestamp  |ev_snapshot  
   | ev_type | ev_data1 | ev_data2 |
> ev_data3 | ev_data4 | ev_data5 | ev_data6 | ev_
> 
> data7 | ev_data8 
> 
> 
---++---++-+--+--+-
> -+--+--+--+
> 
> --+--
> 
>  8 | 5002075962 | 2018-02-21 19:19:41.958719+00 | 
87044110:87044110: | SYNC|  |  | 
> |  |  |  |
> 
>   | 
> 
> (1 row)
> 
>  
> 
> ams=# 
> 
>  master DB leftover event 
> 
>  
> 
>  master log drop node record 
> 
> 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8
> 
> 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=4 
li_receiver=1 li_provider=4
> 
> 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=7 
li_receiver=1 li_provider=7
> 
> 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=3 
li_receiver=1 li_provider=3
> 
> 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: update 
provider configuration
> 
> 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection 
for provider 4 terminated
> 
> 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: 
disconnecting from data provider 4
> 
> 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection 
for provider 7 terminated
> 
>  master log drop node record 
> 
>  
> 
>  replica log drop node record 
> 
> 2018-02-21 19:19:51 UTC [22650] WARN   remoteWorkerThread_1: got DROP 
NODE for local node ID
> 
> NOTICE:  Slony-I: Please drop schema "_ams_cluster"
> 
> 2018-02-21 19:19:53 UTC [22650] INFO   remoteWorkerThread_7: SYNC 
5001868819 done in 2.153 seconds
> 
> NOTICE:  drop cascades to 243 other objects
> 
> DETAIL:  drop cascades to table

Re: [Slony1-general] slony1 drop node failure

2018-02-22 Thread Tignor, Tom


Wow. Yes, this looks a lot like the tmblue 9/10/16 event. I’ll be very 
interested to take a look at that patch.
Haven’t forgotten the 2.2.6 patches I have in flight. Need to finish 
some other work and pull those off the back burner…
Thanks,

Tom(


On 2/22/18, 6:06 PM, "Steve Singer"  wrote:

On Thu, 22 Feb 2018, Tignor, Tom wrote:

Looks like?
http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html

I can't remember if that was what prompted 
http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html


https://github.com/ssinger/slony1-engine/tree/bug375

I can't seem to find a reason why this wasn't committed.


> 
>  
> 
> Hello slony1 community,
> 
> We have a head scratcher here. It appears a DROP NODE 
command was not fully processed. The
> command was issued and confirmed on all our nodes at approximately 
2018-02-21 19:19:50 UTC. When we went to
> restore it over two hours later, all replication stopped on an sl_event 
constraint violation. Investigation
> showed a SYNC event for the dropped node with a timestamp of just a few 
seconds before the drop. I believe this
> is a first for us. The DROP NODE command is supposed to remove all state 
for the dropped node. Is that right? Is
> there a potential race condition somewhere which could leave behind state?
> 
> Thanks in advance,
> 
>  
> 
>  master log replication freeze error 
> 
> 2018-02-21 21:38:52 UTC [5775] ERROR  remoteWorkerThread_8: "insert into 
"_ams_cluster".sl_event (ev_origin,
> ev_seqno, ev_timestamp,  ev_snapshot, ev\
> 
> _type ) values ('8', '5002075962', '2018-02-21 19:19:41.958719+00', 
'87044110:87044110:', 'SYNC'); insert
> into "_ams_cluster".sl_confirm   (con_origi\
> 
> n, con_received, con_seqno, con_timestamp)values (8, 1, '5002075962', 
now()); select
> "_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139 s'::inter\
> 
> val); commit transaction;" PGRES_FATAL_ERROR ERROR:  duplicate key value 
violates unique constraint
> "sl_event-pkey"
> 
> DETAIL:  Key (ev_origin, ev_seqno)=(8, 5002075962) already exists.
> 
> 2018-02-21 21:38:52 UTC [13649] CONFIG slon: child terminated signal: 9; 
pid: 5775, current worker pid: 5775
> 
> 2018-02-21 21:38:52 UTC [13649] CONFIG slon: restart of worker in 10 
seconds
> 
>  master log replication freeze error 
> 
>  
> 
>  
> 
>  master DB leftover event 
> 
> a...@ams6.cmb.netmgmt:~$ psql -U akamai -d ams
> 
> psql (9.1.24)
> 
> Type "help" for help.
> 
>  
> 
> ams=# select * from sl_event_bak;
> 
>  ev_origin |  ev_seqno  | ev_timestamp  |ev_snapshot  
   | ev_type | ev_data1 | ev_data2 |
> ev_data3 | ev_data4 | ev_data5 | ev_data6 | ev_
> 
> data7 | ev_data8 
> 
> 
---++---++-+--+--+-
> -+--+--+--+
> 
> --+--
> 
>  8 | 5002075962 | 2018-02-21 19:19:41.958719+00 | 
87044110:87044110: | SYNC|  |  | 
> |  |  |  |
> 
>   | 
> 
> (1 row)
> 
>  
> 
> ams=# 
> 
>  master DB leftover event 
> 
>  
> 
>  master log drop node record 
> 
> 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8
> 
> 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=4 
li_receiver=1 li_provider=4
> 
> 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=7 
li_receiver=1 li_provider=7
> 
> 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=3 
li_receiver=1 li_provider=3
> 
> 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: update 
provider configuration
> 
> 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection 
for provider 4 terminated
> 
> 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: 
disconnecting from data provider 4
> 
> 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection 
for provider 7 terminated
> 
>  master log drop node record 
> 
>  
> 
>  replica log drop node record 
> 
> 2018-02-21 19:19:51 UTC [22650] WARN   remoteWorkerThread_1: got DROP 
NODE for local node ID
> 
> NOTICE:  Slony-I: Please drop schema "_ams_cluster"
> 
> 2018-02-21 19:19:53 UTC [22650] INFO   remoteWorkerThread_7: SYNC 
5001868819 done in 2.153 seconds
> 
> NOTICE:  drop cascades to 243 other

Re: [Slony1-general] slony1 drop node failure

2018-02-22 Thread Steve Singer


On Thu, 22 Feb 2018, Tignor, Tom wrote:

Looks like?
http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html

I can't remember if that was what prompted 
http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html



https://github.com/ssinger/slony1-engine/tree/bug375

I can't seem to find a reason why this wasn't committed.




 

    Hello slony1 community,

    We have a head scratcher here. It appears a DROP NODE command 
was not fully processed. The
command was issued and confirmed on all our nodes at approximately 2018-02-21 
19:19:50 UTC. When we went to
restore it over two hours later, all replication stopped on an sl_event 
constraint violation. Investigation
showed a SYNC event for the dropped node with a timestamp of just a few seconds 
before the drop. I believe this
is a first for us. The DROP NODE command is supposed to remove all state for 
the dropped node. Is that right? Is
there a potential race condition somewhere which could leave behind state?

    Thanks in advance,

 

 master log replication freeze error 

2018-02-21 21:38:52 UTC [5775] ERROR  remoteWorkerThread_8: "insert into 
"_ams_cluster".sl_event     (ev_origin,
ev_seqno, ev_timestamp,      ev_snapshot, ev\

_type     ) values ('8', '5002075962', '2018-02-21 19:19:41.958719+00', 
'87044110:87044110:', 'SYNC'); insert
into "_ams_cluster".sl_confirm       (con_origi\

n, con_received, con_seqno, con_timestamp)    values (8, 1, '5002075962', 
now()); select
"_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139 s'::inter\

val); commit transaction;" PGRES_FATAL_ERROR ERROR:  duplicate key value 
violates unique constraint
"sl_event-pkey"

DETAIL:  Key (ev_origin, ev_seqno)=(8, 5002075962) already exists.

2018-02-21 21:38:52 UTC [13649] CONFIG slon: child terminated signal: 9; pid: 
5775, current worker pid: 5775

2018-02-21 21:38:52 UTC [13649] CONFIG slon: restart of worker in 10 seconds

 master log replication freeze error 

 

 

 master DB leftover event 

a...@ams6.cmb.netmgmt:~$ psql -U akamai -d ams

psql (9.1.24)

Type "help" for help.

 

ams=# select * from sl_event_bak;

 ev_origin |  ev_seqno  |         ev_timestamp          |    ev_snapshot     | 
ev_type | ev_data1 | ev_data2 |
ev_data3 | ev_data4 | ev_data5 | ev_data6 | ev_

data7 | ev_data8 

---++---++-+--+--+-
-+--+--+--+

--+--

         8 | 5002075962 | 2018-02-21 19:19:41.958719+00 | 87044110:87044110: | 
SYNC    |          |          | 
        |          |          |          |    

      | 

(1 row)

 

ams=# 

 master DB leftover event 

 

 master log drop node record 

2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8

2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=4 li_receiver=1 
li_provider=4

2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=7 li_receiver=1 
li_provider=7

2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=3 li_receiver=1 
li_provider=3

2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: update provider 
configuration

2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection for 
provider 4 terminated

2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: disconnecting from 
data provider 4

2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection for 
provider 7 terminated

 master log drop node record 

 

 replica log drop node record 

2018-02-21 19:19:51 UTC [22650] WARN   remoteWorkerThread_1: got DROP NODE for 
local node ID

NOTICE:  Slony-I: Please drop schema "_ams_cluster"

2018-02-21 19:19:53 UTC [22650] INFO   remoteWorkerThread_7: SYNC 5001868819 
done in 2.153 seconds

NOTICE:  drop cascades to 243 other objects

DETAIL:  drop cascades to table _ams_cluster.sl_node

drop cascades to table _ams_cluster.sl_nodelock

drop cascades to table _ams_cluster.sl_set

drop cascades to table _ams_cluster.sl_setsync

drop cascades to table _ams_cluster.sl_table

drop cascades to table _ams_cluster.sl_sequence

 replica log drop node record 

 

 

    Tom    ☺

 

 



___
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general

[Slony1-general] slony1 drop node failure

2018-02-22 Thread Tignor, Tom


Hello slony1 community,
We have a head scratcher here. It appears a DROP NODE command 
was not fully processed. The command was issued and confirmed on all our nodes 
at approximately 2018-02-21 19:19:50 UTC. When we went to restore it over two 
hours later, all replication stopped on an sl_event constraint violation. 
Investigation showed a SYNC event for the dropped node with a timestamp of just 
a few seconds before the drop. I believe this is a first for us. The DROP NODE 
command is supposed to remove all state for the dropped node. Is that right? Is 
there a potential race condition somewhere which could leave behind state?
Thanks in advance,

 master log replication freeze error 
2018-02-21 21:38:52 UTC [5775] ERROR  remoteWorkerThread_8: "insert into 
"_ams_cluster".sl_event (ev_origin, ev_seqno, ev_timestamp,  
ev_snapshot, ev\
_type ) values ('8', '5002075962', '2018-02-21 19:19:41.958719+00', 
'87044110:87044110:', 'SYNC'); insert into "_ams_cluster".sl_confirm   
(con_origi\
n, con_received, con_seqno, con_timestamp)values (8, 1, '5002075962', 
now()); select "_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139 
s'::inter\
val); commit transaction;" PGRES_FATAL_ERROR ERROR:  duplicate key value 
violates unique constraint "sl_event-pkey"
DETAIL:  Key (ev_origin, ev_seqno)=(8, 5002075962) already exists.
2018-02-21 21:38:52 UTC [13649] CONFIG slon: child terminated signal: 9; pid: 
5775, current worker pid: 5775
2018-02-21 21:38:52 UTC [13649] CONFIG slon: restart of worker in 10 seconds
 master log replication freeze error 


 master DB leftover event 
a...@ams6.cmb.netmgmt:~$ psql -U akamai -d ams
psql (9.1.24)
Type "help" for help.

ams=# select * from sl_event_bak;
 ev_origin |  ev_seqno  | ev_timestamp  |ev_snapshot | 
ev_type | ev_data1 | ev_data2 | ev_data3 | ev_data4 | ev_data5 | ev_data6 | ev_
data7 | ev_data8
---++---++-+--+--+--+--+--+--+
--+--
 8 | 5002075962 | 2018-02-21 19:19:41.958719+00 | 87044110:87044110: | 
SYNC|  |  |  |  |  |  |
  |
(1 row)

ams=#
 master DB leftover event 

 master log drop node record 
2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8
2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=4 li_receiver=1 
li_provider=4
2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=7 li_receiver=1 
li_provider=7
2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=3 li_receiver=1 
li_provider=3
2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: update provider 
configuration
2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection for 
provider 4 terminated
2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: disconnecting from 
data provider 4
2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection for 
provider 7 terminated
 master log drop node record 

 replica log drop node record 
2018-02-21 19:19:51 UTC [22650] WARN   remoteWorkerThread_1: got DROP NODE for 
local node ID
NOTICE:  Slony-I: Please drop schema "_ams_cluster"
2018-02-21 19:19:53 UTC [22650] INFO   remoteWorkerThread_7: SYNC 5001868819 
done in 2.153 seconds
NOTICE:  drop cascades to 243 other objects
DETAIL:  drop cascades to table _ams_cluster.sl_node
drop cascades to table _ams_cluster.sl_nodelock
drop cascades to table _ams_cluster.sl_set
drop cascades to table _ams_cluster.sl_setsync
drop cascades to table _ams_cluster.sl_table
drop cascades to table _ams_cluster.sl_sequence
 replica log drop node record 


Tom☺


___
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general

Re: [Slony1-general] slony1 drop node failure

Re: [Slony1-general] slony1 drop node failure

Re: [Slony1-general] slony1 drop node failure

Re: [Slony1-general] slony1 drop node failure

Re: [Slony1-general] slony1 drop node failure

Re: [Slony1-general] slony1 drop node failure

Re: [Slony1-general] slony1 drop node failure

Re: [Slony1-general] slony1 drop node failure

Re: [Slony1-general] slony1 drop node failure

[Slony1-general] slony1 drop node failure

10 matches

Site Navigation

Mail list logo

Footer information