Re: [Slony1-general] failover failure and mysterious missing paths

2017-07-06 Thread Tignor, Tom

Hi Steve,
Your diagrams and description would make sense. In our failover, 
though, we selected node 3 (“failover (id=1, backup node=3)”). The output I 
provided (see below) seems to show we failed because we were missing paths 
2<->4 and 2<->5. With slony1-2.2.4, I can actually reproduce this by deleting 
those paths, and sometimes (not always) the failover will self-correct if I add 
them back after a prolonged delay. It seems in the problem case, 2 is farthest 
ahead at failover time, and while I might have hoped 2 would only have to feed 
3, it seems to have to feed everybody.
Sorry to hear about the Bugzilla spam. We are pretty well invested in 
slony1 at this point, so if there is a way I can contribute for this or other 
efforts, certainly let me know.
Thanks,

Tom(


On 7/5/17, 9:53 PM, "Steve Singer"  wrote:

On Wed, 5 Jul 2017, Tignor, Tom wrote:

>
>   Interesting. Of course the behavior evident on inspection indicated 
something like this must be happening.
>   It seems the doc could be improved on the subject of required paths. I 
recall some sections indicate it is not harmful to have a path from each node 
to each other node. What seems not to be spelled out is that for the service to 
be highly available, to have the ability to failover, each node is *required* 
to have a path to each other node.
>   On a related point, it would be a lot more convenient if we could give 
each node a default path instead of re-specifying the same IP for each new 
subscriber, and a new line of conninfo for every slonik script.
>   Would either of these items be worth writing up in bug tracking and/or 
providing the solution? If so, could I get that link?

You don't need a path to EVERY other node, you just need a path to nodes 
that might be the providers as part of the failover.

For example

1-->2-->3
 |
 V
 4

If that is the direction of the replication flow, and the paths (plus back 
paths).  Node 2 is the only viable failover candidate for node 1.  There 
isn't a reason why node 3 and 4 need to have paths between each other.

However

1--->2
|\
V \
3  4

means that any of the nodes 2,3,4 might be failover candidates and if node 
3 
becomes the new origin then there would need to be paths between 3 and 2,4

I tried to capture a lot of these rules in the sl_failover_targets view.


We had to take the slony bugzilla instance offline because of excesive spam.


Steve


>   Tom(
>
>
> On 7/2/17, 9:30 PM, "Steve Singer"  wrote:
>
>On Wed, 28 Jun 2017, Tignor, Tom wrote:
>
>>
>>  Hi Steve,
>>  Thanks for the info. I was able to repro this problem in 
testing and saw as soon as I added the missing path back the still-in-process 
failover op continued on and completed successfully.
>>  We do issue DROP NODEs in the event we need to restore a 
replica from scratch, which did occur. However, the restore workflow also 
should issue store paths to/from the new replica node and every other node. 
Still investigating this.
>>  What still confuses me is the recurring “remoteWorkerThread_X: 
SYNC” output, despite the fact of not having a configured path. If the path is 
missing, how does slon continue to get SYNC events?
>
>Slon can get events including SYNC from nodes other than the event 
origin if
>it has a path to that node.   However a slon can only replicate the 
data
>from a node it has a path to.
>
>
>Steve
>
>
>
>>
>>  Tom(
>>
>>
>> On 6/27/17, 5:04 PM, "Steve Singer"  wrote:
>>
>>On 06/27/2017 11:59 AM, Tignor, Tom wrote:
>>
>>
>>The disableNode() in the makes it look like someone did a DROP 
NODE
>>
>>If the only issue is that your missing active paths in sl_path 
you can
>>add/update the paths with slonik.
>>
>>
>>
>>
>>> **
>>>
>>> **Hello Slony-I community,
>>>
>>>  Hoping someone can advise on a strange and serious 
problem.
>>> We performed a slony service failover yesterday. For the first 
time
>>> ever, our slony service FAILOVER op errored out. We recently 
expanded
>>> our cluster to 7 consumers from a single provider. There are no 
load
>>> issues during normal operations. As the error output below 
shows,
>>> though, our node 4 and node 5 consumers never got the events 
they
>>> needed. Here’s where it gets weird: closer inspection has shown 
that
>>> node 2->4 and 

Re: [Slony1-general] failover failure and mysterious missing paths

2017-07-05 Thread Steve Singer

On Wed, 5 Jul 2017, Tignor, Tom wrote:



Interesting. Of course the behavior evident on inspection indicated 
something like this must be happening.
It seems the doc could be improved on the subject of required paths. I 
recall some sections indicate it is not harmful to have a path from each node 
to each other node. What seems not to be spelled out is that for the service to 
be highly available, to have the ability to failover, each node is *required* 
to have a path to each other node.
On a related point, it would be a lot more convenient if we could give 
each node a default path instead of re-specifying the same IP for each new 
subscriber, and a new line of conninfo for every slonik script.
Would either of these items be worth writing up in bug tracking and/or 
providing the solution? If so, could I get that link?


You don't need a path to EVERY other node, you just need a path to nodes 
that might be the providers as part of the failover.


For example

1-->2-->3
|
V
4

If that is the direction of the replication flow, and the paths (plus back 
paths).  Node 2 is the only viable failover candidate for node 1.  There 
isn't a reason why node 3 and 4 need to have paths between each other.


However

1--->2
|\
V \
3  4

means that any of the nodes 2,3,4 might be failover candidates and if node 3 
becomes the new origin then there would need to be paths between 3 and 2,4


I tried to capture a lot of these rules in the sl_failover_targets view.


We had to take the slony bugzilla instance offline because of excesive spam.


Steve



Tom(


On 7/2/17, 9:30 PM, "Steve Singer"  wrote:

   On Wed, 28 Jun 2017, Tignor, Tom wrote:

   >
   > Hi Steve,
   > Thanks for the info. I was able to repro this problem in testing 
and saw as soon as I added the missing path back the still-in-process failover op 
continued on and completed successfully.
   > We do issue DROP NODEs in the event we need to restore a replica 
from scratch, which did occur. However, the restore workflow also should issue 
store paths to/from the new replica node and every other node. Still investigating 
this.
   > What still confuses me is the recurring “remoteWorkerThread_X: 
SYNC” output, despite the fact of not having a configured path. If the path is 
missing, how does slon continue to get SYNC events?

   Slon can get events including SYNC from nodes other than the event origin if
   it has a path to that node.   However a slon can only replicate the data
   from a node it has a path to.


   Steve



   >
   > Tom(
   >
   >
   > On 6/27/17, 5:04 PM, "Steve Singer"  wrote:
   >
   >On 06/27/2017 11:59 AM, Tignor, Tom wrote:
   >
   >
   >The disableNode() in the makes it look like someone did a DROP NODE
   >
   >If the only issue is that your missing active paths in sl_path you can
   >add/update the paths with slonik.
   >
   >
   >
   >
   >> **
   >>
   >> **Hello Slony-I community,
   >>
   >>  Hoping someone can advise on a strange and serious 
problem.
   >> We performed a slony service failover yesterday. For the first time
   >> ever, our slony service FAILOVER op errored out. We recently expanded
   >> our cluster to 7 consumers from a single provider. There are no load
   >> issues during normal operations. As the error output below shows,
   >> though, our node 4 and node 5 consumers never got the events they
   >> needed. Here’s where it gets weird: closer inspection has shown that
   >> node 2->4 and node 2->5 path data went missing out of the service at
   >> some point. It seems clear that’s the main issue, but in spite of 
that,
   >> both node 4 and node 5 continued to find and process node 2 SYNC 
events
   >> for a full week! The logs show this happened in spite of multiple 
restarts.
   >>
   >> How can this happen? If missing path data stymies the failover, 
wouldn’t
   >> it also prevent normal SYNC processing?
   >>
   >> In the case where a failover is begun with inadequate path data, 
what’s
   >> the best resolution? Can path data be quickly applied to allow 
failover
   >> to succeed?
   >>
   >>  Thanks in advance for any insights.
   >>
   >>  failover error 
   >>
   >> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: 
NOTICE:
   >> calling restart node 1
   >>
   >> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:55:
   >> 2017-06-26 18:33:02
   >>
   >> executing preFailover(1,1) on 2
   >>
   >> executing preFailover(1,1) on 3
   >>
   >> executing preFailover(1,1) on 4
   >>
   >> executing preFailover(1,1) on 5
   >>
   >> executing preFailover(1,1) on 6
   >>
   >> executing preFailover(1,1) on 7
   >>
   

Re: [Slony1-general] failover failure and mysterious missing paths

2017-07-05 Thread Tignor, Tom

Interesting. Of course the behavior evident on inspection indicated 
something like this must be happening. 
It seems the doc could be improved on the subject of required paths. I 
recall some sections indicate it is not harmful to have a path from each node 
to each other node. What seems not to be spelled out is that for the service to 
be highly available, to have the ability to failover, each node is *required* 
to have a path to each other node. 
On a related point, it would be a lot more convenient if we could give 
each node a default path instead of re-specifying the same IP for each new 
subscriber, and a new line of conninfo for every slonik script. 
Would either of these items be worth writing up in bug tracking and/or 
providing the solution? If so, could I get that link?

Tom(


On 7/2/17, 9:30 PM, "Steve Singer"  wrote:

On Wed, 28 Jun 2017, Tignor, Tom wrote:

>
>   Hi Steve,
>   Thanks for the info. I was able to repro this problem in testing and 
saw as soon as I added the missing path back the still-in-process failover op 
continued on and completed successfully.
>   We do issue DROP NODEs in the event we need to restore a replica from 
scratch, which did occur. However, the restore workflow also should issue store 
paths to/from the new replica node and every other node. Still investigating 
this.
>   What still confuses me is the recurring “remoteWorkerThread_X: SYNC” 
output, despite the fact of not having a configured path. If the path is 
missing, how does slon continue to get SYNC events?

Slon can get events including SYNC from nodes other than the event origin 
if 
it has a path to that node.   However a slon can only replicate the data 
from a node it has a path to.


Steve



>
>   Tom(
>
>
> On 6/27/17, 5:04 PM, "Steve Singer"  wrote:
>
>On 06/27/2017 11:59 AM, Tignor, Tom wrote:
>
>
>The disableNode() in the makes it look like someone did a DROP NODE
>
>If the only issue is that your missing active paths in sl_path you can
>add/update the paths with slonik.
>
>
>
>
>> **
>>
>> **Hello Slony-I community,
>>
>>  Hoping someone can advise on a strange and serious 
problem.
>> We performed a slony service failover yesterday. For the first time
>> ever, our slony service FAILOVER op errored out. We recently expanded
>> our cluster to 7 consumers from a single provider. There are no load
>> issues during normal operations. As the error output below shows,
>> though, our node 4 and node 5 consumers never got the events they
>> needed. Here’s where it gets weird: closer inspection has shown that
>> node 2->4 and node 2->5 path data went missing out of the service at
>> some point. It seems clear that’s the main issue, but in spite of 
that,
>> both node 4 and node 5 continued to find and process node 2 SYNC 
events
>> for a full week! The logs show this happened in spite of multiple 
restarts.
>>
>> How can this happen? If missing path data stymies the failover, 
wouldn’t
>> it also prevent normal SYNC processing?
>>
>> In the case where a failover is begun with inadequate path data, 
what’s
>> the best resolution? Can path data be quickly applied to allow 
failover
>> to succeed?
>>
>>  Thanks in advance for any insights.
>>
>>  failover error 
>>
>> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: 
NOTICE:
>> calling restart node 1
>>
>> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:55:
>> 2017-06-26 18:33:02
>>
>> executing preFailover(1,1) on 2
>>
>> executing preFailover(1,1) on 3
>>
>> executing preFailover(1,1) on 4
>>
>> executing preFailover(1,1) on 5
>>
>> executing preFailover(1,1) on 6
>>
>> executing preFailover(1,1) on 7
>>
>> executing preFailover(1,1) on 8
>>
>> NOTICE: executing "_ams_cluster".failedNode2 on node 2
>>
>> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: 
waiting
>> for event (2,561664).  node 8 only on event 561654, node 4 
only
>> on event 561654, node 5 only on event 561655, node 3 only on
>> event 561662, node 6\
>>
>>   only on event 561654, node 7 only on event 561656
>>
>> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: 
waiting
>> for event (2,561664).  node 4 only on event 561657, node 5 
only
>> on event 561663, 

Re: [Slony1-general] failover failure and mysterious missing paths

2017-07-02 Thread Steve Singer

On Wed, 28 Jun 2017, Tignor, Tom wrote:



Hi Steve,
Thanks for the info. I was able to repro this problem in testing and 
saw as soon as I added the missing path back the still-in-process failover op 
continued on and completed successfully.
We do issue DROP NODEs in the event we need to restore a replica from 
scratch, which did occur. However, the restore workflow also should issue store 
paths to/from the new replica node and every other node. Still investigating 
this.
What still confuses me is the recurring “remoteWorkerThread_X: SYNC” 
output, despite the fact of not having a configured path. If the path is 
missing, how does slon continue to get SYNC events?


Slon can get events including SYNC from nodes other than the event origin if 
it has a path to that node.   However a slon can only replicate the data 
from a node it has a path to.



Steve





Tom(


On 6/27/17, 5:04 PM, "Steve Singer"  wrote:

   On 06/27/2017 11:59 AM, Tignor, Tom wrote:


   The disableNode() in the makes it look like someone did a DROP NODE

   If the only issue is that your missing active paths in sl_path you can
   add/update the paths with slonik.




   > **
   >
   > **Hello Slony-I community,
   >
   >  Hoping someone can advise on a strange and serious problem.
   > We performed a slony service failover yesterday. For the first time
   > ever, our slony service FAILOVER op errored out. We recently expanded
   > our cluster to 7 consumers from a single provider. There are no load
   > issues during normal operations. As the error output below shows,
   > though, our node 4 and node 5 consumers never got the events they
   > needed. Here’s where it gets weird: closer inspection has shown that
   > node 2->4 and node 2->5 path data went missing out of the service at
   > some point. It seems clear that’s the main issue, but in spite of that,
   > both node 4 and node 5 continued to find and process node 2 SYNC events
   > for a full week! The logs show this happened in spite of multiple restarts.
   >
   > How can this happen? If missing path data stymies the failover, wouldn’t
   > it also prevent normal SYNC processing?
   >
   > In the case where a failover is begun with inadequate path data, what’s
   > the best resolution? Can path data be quickly applied to allow failover
   > to succeed?
   >
   >  Thanks in advance for any insights.
   >
   >  failover error 
   >
   > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: NOTICE:
   > calling restart node 1
   >
   > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:55:
   > 2017-06-26 18:33:02
   >
   > executing preFailover(1,1) on 2
   >
   > executing preFailover(1,1) on 3
   >
   > executing preFailover(1,1) on 4
   >
   > executing preFailover(1,1) on 5
   >
   > executing preFailover(1,1) on 6
   >
   > executing preFailover(1,1) on 7
   >
   > executing preFailover(1,1) on 8
   >
   > NOTICE: executing "_ams_cluster".failedNode2 on node 2
   >
   > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
   > for event (2,561664).  node 8 only on event 561654, node 4 only
   > on event 561654, node 5 only on event 561655, node 3 only on
   > event 561662, node 6\
   >
   >   only on event 561654, node 7 only on event 561656
   >
   > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
   > for event (2,561664).  node 4 only on event 561657, node 5 only
   > on event 561663, node 3 only on event 561663, node 6 only on
   > event 561663
   >
   > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
   > for event (2,561664).  node 4 only on event 561663, node 5 only
   > on event 561663, node 6 only on event 561663
   >
   > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
   > for event (2,561664).  node 4 only on event 561663, node 5 only
   > on event 561663
   >
   > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
   > for event (2,561664).  node 4 only on event 561663, node 5 only
   > on event 561663
   >
   > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
   > for event (2,561664).  node 4 only on event 561663, node 5 only
   > on event 561663
   >
   > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
   > for event (2,561664).  node 4 only on event 561663, node 5 only
   > on event 561663
   >
   > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
   > for event (2,561664).  node 4 only on event 561663, node 5 only
   > on event 561663
   >
   > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
   > for event (2,561664).  node 4 only on event 561663, node 5 only
   > on event 561663
   >
   > 

Re: [Slony1-general] failover failure and mysterious missing paths

2017-06-28 Thread Tignor, Tom

Hi Steve,
Thanks for the info. I was able to repro this problem in testing and 
saw as soon as I added the missing path back the still-in-process failover op 
continued on and completed successfully.
We do issue DROP NODEs in the event we need to restore a replica from 
scratch, which did occur. However, the restore workflow also should issue store 
paths to/from the new replica node and every other node. Still investigating 
this.
What still confuses me is the recurring “remoteWorkerThread_X: SYNC” 
output, despite the fact of not having a configured path. If the path is 
missing, how does slon continue to get SYNC events?

Tom(


On 6/27/17, 5:04 PM, "Steve Singer"  wrote:

On 06/27/2017 11:59 AM, Tignor, Tom wrote:


The disableNode() in the makes it look like someone did a DROP NODE

If the only issue is that your missing active paths in sl_path you can 
add/update the paths with slonik.




> **
>
> **Hello Slony-I community,
>
>  Hoping someone can advise on a strange and serious problem.
> We performed a slony service failover yesterday. For the first time
> ever, our slony service FAILOVER op errored out. We recently expanded
> our cluster to 7 consumers from a single provider. There are no load
> issues during normal operations. As the error output below shows,
> though, our node 4 and node 5 consumers never got the events they
> needed. Here’s where it gets weird: closer inspection has shown that
> node 2->4 and node 2->5 path data went missing out of the service at
> some point. It seems clear that’s the main issue, but in spite of that,
> both node 4 and node 5 continued to find and process node 2 SYNC events
> for a full week! The logs show this happened in spite of multiple 
restarts.
>
> How can this happen? If missing path data stymies the failover, wouldn’t
> it also prevent normal SYNC processing?
>
> In the case where a failover is begun with inadequate path data, what’s
> the best resolution? Can path data be quickly applied to allow failover
> to succeed?
>
>  Thanks in advance for any insights.
>
>  failover error 
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: NOTICE:
> calling restart node 1
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:55:
> 2017-06-26 18:33:02
>
> executing preFailover(1,1) on 2
>
> executing preFailover(1,1) on 3
>
> executing preFailover(1,1) on 4
>
> executing preFailover(1,1) on 5
>
> executing preFailover(1,1) on 6
>
> executing preFailover(1,1) on 7
>
> executing preFailover(1,1) on 8
>
> NOTICE: executing "_ams_cluster".failedNode2 on node 2
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 8 only on event 561654, node 4 only
> on event 561654, node 5 only on event 561655, node 3 only on
> event 561662, node 6\
>
>   only on event 561654, node 7 only on event 561656
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561657, node 5 only
> on event 561663, node 3 only on event 561663, node 6 only on
> event 561663
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561663, node 5 only
> on event 561663, node 6 only on event 561663
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561663, node 5 only
> on event 561663
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561663, node 5 only
> on event 561663
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561663, node 5 only
> on event 561663
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561663, node 5 only
> on event 561663
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561663, node 5 only
> on event 561663
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561663, node 5 only
> on event 561663
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 

Re: [Slony1-general] failover failure and mysterious missing paths

2017-06-27 Thread Steve Singer
On 06/27/2017 11:59 AM, Tignor, Tom wrote:


The disableNode() in the makes it look like someone did a DROP NODE

If the only issue is that your missing active paths in sl_path you can 
add/update the paths with slonik.




> **
>
> **Hello Slony-I community,
>
>  Hoping someone can advise on a strange and serious problem.
> We performed a slony service failover yesterday. For the first time
> ever, our slony service FAILOVER op errored out. We recently expanded
> our cluster to 7 consumers from a single provider. There are no load
> issues during normal operations. As the error output below shows,
> though, our node 4 and node 5 consumers never got the events they
> needed. Here’s where it gets weird: closer inspection has shown that
> node 2->4 and node 2->5 path data went missing out of the service at
> some point. It seems clear that’s the main issue, but in spite of that,
> both node 4 and node 5 continued to find and process node 2 SYNC events
> for a full week! The logs show this happened in spite of multiple restarts.
>
> How can this happen? If missing path data stymies the failover, wouldn’t
> it also prevent normal SYNC processing?
>
> In the case where a failover is begun with inadequate path data, what’s
> the best resolution? Can path data be quickly applied to allow failover
> to succeed?
>
>  Thanks in advance for any insights.
>
>  failover error 
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: NOTICE:
> calling restart node 1
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:55:
> 2017-06-26 18:33:02
>
> executing preFailover(1,1) on 2
>
> executing preFailover(1,1) on 3
>
> executing preFailover(1,1) on 4
>
> executing preFailover(1,1) on 5
>
> executing preFailover(1,1) on 6
>
> executing preFailover(1,1) on 7
>
> executing preFailover(1,1) on 8
>
> NOTICE: executing "_ams_cluster".failedNode2 on node 2
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 8 only on event 561654, node 4 only
> on event 561654, node 5 only on event 561655, node 3 only on
> event 561662, node 6\
>
>   only on event 561654, node 7 only on event 561656
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561657, node 5 only
> on event 561663, node 3 only on event 561663, node 6 only on
> event 561663
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561663, node 5 only
> on event 561663, node 6 only on event 561663
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561663, node 5 only
> on event 561663
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561663, node 5 only
> on event 561663
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561663, node 5 only
> on event 561663
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561663, node 5 only
> on event 561663
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561663, node 5 only
> on event 561663
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561663, node 5 only
> on event 561663
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561663, node 5 only
> on event 561663
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561663, node 5 only
> on event 561663
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561663, node 5 only
> on event 561663
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561663, node 5 only
> on event 561663
>
> /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
> for event (2,561664).  node 4 only on event 561663, node 5 only
> on event 561663
>
>  node 4 log archive 
>
> bos-mpt5c:odin-9353 ttignor$ egrep 'disableNode: no_id=2|storePath:
> pa_server=2 pa_client=4|restart notification' prod4/node4-pathconfig.out
>
> 2017-06-15 15:14:00 UTC [5688] INFO   localListenThread: got restart
> notification
>
> 2017-06-15 15:14:10 UTC [8431] CONFIG storePath: pa_server=2 pa_client=4
> pa_conninfo="dbname=ams
>
> 2017-06-15 15:53:00 UTC [8431] INFO   

[Slony1-general] failover failure and mysterious missing paths

2017-06-27 Thread Tignor, Tom

Hello Slony-I community,
Hoping someone can advise on a strange and serious problem. We 
performed a slony service failover yesterday. For the first time ever, our 
slony service FAILOVER op errored out. We recently expanded our cluster to 7 
consumers from a single provider. There are no load issues during normal 
operations. As the error output below shows, though, our node 4 and node 5 
consumers never got the events they needed. Here’s where it gets weird: closer 
inspection has shown that node 2->4 and node 2->5 path data went missing out of 
the service at some point. It seems clear that’s the main issue, but in spite 
of that, both node 4 and node 5 continued to find and process node 2 SYNC 
events for a full week! The logs show this happened in spite of multiple 
restarts.
How can this happen? If missing path data stymies the failover, wouldn’t it 
also prevent normal SYNC processing?
In the case where a failover is begun with inadequate path data, what’s the 
best resolution? Can path data be quickly applied to allow failover to succeed?
Thanks in advance for any insights.


 failover error 

/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: NOTICE:  
calling restart node 1
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:55: 2017-06-26 
18:33:02
executing preFailover(1,1) on 2
executing preFailover(1,1) on 3
executing preFailover(1,1) on 4
executing preFailover(1,1) on 5
executing preFailover(1,1) on 6
executing preFailover(1,1) on 7
executing preFailover(1,1) on 8
NOTICE: executing "_ams_cluster".failedNode2 on node 2
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for 
event (2,561664).  node 8 only on event 561654, node 4 only on event 
561654, node 5 only on event 561655, node 3 only on event 561662, 
node 6\
 only on event 561654, node 7 only on event 561656
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for 
event (2,561664).  node 4 only on event 561657, node 5 only on event 
561663, node 3 only on event 561663, node 6 only on event 561663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for 
event (2,561664).  node 4 only on event 561663, node 5 only on event 
561663, node 6 only on event 561663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for 
event (2,561664).  node 4 only on event 561663, node 5 only on event 
561663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for 
event (2,561664).  node 4 only on event 561663, node 5 only on event 
561663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for 
event (2,561664).  node 4 only on event 561663, node 5 only on event 
561663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for 
event (2,561664).  node 4 only on event 561663, node 5 only on event 
561663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for 
event (2,561664).  node 4 only on event 561663, node 5 only on event 
561663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for 
event (2,561664).  node 4 only on event 561663, node 5 only on event 
561663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for 
event (2,561664).  node 4 only on event 561663, node 5 only on event 
561663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for 
event (2,561664).  node 4 only on event 561663, node 5 only on event 
561663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for 
event (2,561664).  node 4 only on event 561663, node 5 only on event 
561663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for 
event (2,561664).  node 4 only on event 561663, node 5 only on event 
561663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for 
event (2,561664).  node 4 only on event 561663, node 5 only on event 
561663


 node 4 log archive 

bos-mpt5c:odin-9353 ttignor$ egrep 'disableNode: no_id=2|storePath: pa_server=2 
pa_client=4|restart notification' prod4/node4-pathconfig.out
2017-06-15 15:14:00 UTC [5688] INFO   localListenThread: got restart 
notification
2017-06-15 15:14:10 UTC [8431] CONFIG storePath: pa_server=2 pa_client=4 
pa_conninfo="dbname=ams
2017-06-15 15:53:00 UTC [8431] INFO   localListenThread: got restart 
notification
2017-06-15 15:53:10 UTC [23701] CONFIG storePath: pa_server=2 pa_client=4 
pa_conninfo="dbname=ams
2017-06-16 17:29:13 UTC [10253] CONFIG storePath: pa_server=2 pa_client=4 
pa_conninfo="dbname=ams
2017-06-16 20:43:42 UTC [2707] CONFIG storePath: pa_server=2 pa_client=4 
pa_conninfo="dbname=ams
2017-06-19 15:11:45 UTC [2707] CONFIG disableNode: no_id=2
2017-06-19 15:11:45