Re: [users] Opensaf-users Digest, Vol 22, Issue 2

Shu Wang Mon, 23 Mar 2015 15:23:50 -0700

My comments started with <SHU>

Shu Wang | Senior Analyst | +1(407)708-5117 or x3917| www.NetCracker.com
Proven Partner to Communications Service Providers



-----Original Message-----
From: [email protected] 
[mailto:[email protected]]
Sent: Friday, March 20, 2015 7:15 AM
To: [email protected]
Subject: Opensaf-users Digest, Vol 22, Issue 2

Send Opensaf-users mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.sourceforge.net/lists/listinfo/opensaf-users
or, via email, send a message with subject or body 'help' to
        [email protected]

You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific than "Re: 
Contents of Opensaf-users digest..."


Today's Topics:

   1. Re: amf-adm question ... (Johnson, Charles)
   2. Re: amf-adm question ... (praveen malviya)
   3. Service Units are in Terminating State (Shu Wang)
   4. Re: Service Units are in Terminating State (A V Mahesh)
   5. Re: Service Units are in Terminating State (praveen malviya)


----------------------------------------------------------------------

Message: 1
Date: Tue, 10 Mar 2015 23:26:25 +0000
From: "Johnson, Charles" <[email protected]>
Subject: Re: [users] amf-adm question ...
To: praveen malviya <[email protected]>,
        "[email protected]"
        <[email protected]>
Message-ID:
        <138f607845c24246a37786deda499d7a0614d...@g6w2495.americas.hpqcorp.net>

Content-Type: text/plain; charset="us-ascii"

So, that works and it scales out, and I encountered yet another issue ...

I scaled this out on ten Ethernet-connected native nodes, running 8 to 10 
service processes per node in 2N groups and got it working by changing the xml 
for the payload-only nodes using your changes.

Thank you very much, Praveen !!

Now, I was only able to get this working, by installing all the software on 
every node (opensaf and my services) and bringing up the vanilla imm.xml for 
opensaf native services, and then doing the one immcfg and four  amf-adm 
commands for each service group (42 service groups) in serial order, from a 
generated bash script file, after opensaf was up and running.

Separately, although I was able to execute the immxml-merge command 
successfully (using the --ignore-variants flag to get rid of the errors from 
common objects in each 2N xml file for each service) and thus produce a valid 
and totally inclusive imm.xml for opensaf + my services (in the exact same 
object order that the commands from the bash file would add them at runtime), 
when I tried to bring opensaf up (sudo service opensafd start) on the two 
controller nodes, or just one of them, it either fails to start or crashes the 
controller node. I checked, and all the merged imm.xml files are identical on 
all nodes, and all the software is installed identically for both the unmerged 
and merged cases.

My thought is that OpenSAF cannot orchestrate the startup of the cluster with 
that many services (outside of its own well-orchestrated startup sequence for 
the native opensaf service taxonomy) and gets in a traffic jam internally, 
hangs or crashes, but does not start.

There appears to have been a service called SCAP sometime in the past, where 
you would modify a file called NCSSystemBOM.xml to add your service to get 
started when OpenSAF first comes up, but that seems not to be the case anymore. 
Did that framework for startup disappear, or get replaced by something else?

Or is there some magic thing I need to do when I do immxml-merge 
--ignore-variants, to allow OpenSAF to come up with that merged imm.xml without 
hanging? If you could not do that magic thing, there would be no use for 
immxml-merge, and it would probably not exist for long, so there must be a 
magic thing, I reckon!

Charlie ...

-----Original Message-----
From: praveen malviya [mailto:[email protected]]
Sent: Sunday, February 15, 2015 8:12 PM
To: Johnson, Charles; [email protected]
Subject: Re: [users] amf-adm question ...



On 14-Feb-15 12:40 AM, Johnson, Charles wrote:
>
> Interesting. When I try to move the AmfDemo, and run it from PL-4 and PL-3, 
> instead of SC-1 and SC2, but it fails to load.
>
> What I did was to change the AppConfig-2N.xml file doing those
> substitutions, the text is included below (it's not long.)
>
> The log states that it cannot find the script which is in the same place it 
> was on all the nodes (/opt/amf_demo/amf_demo_script), or that it is corrupt, 
> which it is not.
>
> Works fine in the controller nodes, not in the payload nodes: am I missing 
> some limitation regarding Amf?
Please configure below mentioned attribute in SU obejct to host it on a desired 
a node. In the sample configuration this attribute is not configured.
<attr>
        <name>saAmfSUHostNodeOrNodeGroup</name>
        <value>safAmfNode=SC-1,safAmfCluster=myAmfCluster</value>
</attr>

See below the configuration with changes.

Thanks
Praveen



------------------------------

Message: 2
Date: Fri, 13 Mar 2015 10:52:31 +0530
From: praveen malviya <[email protected]>
Subject: Re: [users] amf-adm question ...
To: "Johnson, Charles" <[email protected]>,
        "[email protected]"
        <[email protected]>
Message-ID: <[email protected]>
Content-Type: text/plain; charset=windows-1252; format=flowed



On 11-Mar-15 4:56 AM, Johnson, Charles wrote:
> So, that works and it scales out, and I encountered yet another issue ...
>
> I scaled this out on ten Ethernet-connected native nodes, running 8 to 10 
> service processes per node in 2N groups and got it working by changing the 
> xml for the payload-only nodes using your changes.
>
> Thank you very much, Praveen !!
>
> Now, I was only able to get this working, by installing all the software on 
> every node (opensaf and my services) and bringing up the vanilla imm.xml for 
> opensaf native services, and then doing the one immcfg and four  amf-adm 
> commands for each service group (42 service groups) in serial order, from a 
> generated bash script file, after opensaf was up and running.
>
> Separately, although I was able to execute the immxml-merge command 
> successfully (using the --ignore-variants flag to get rid of the errors from 
> common objects in each 2N xml file for each service) and thus produce a valid 
> and totally inclusive imm.xml for opensaf + my services (in the exact same 
> object order that the commands from the bash file would add them at runtime), 
> when I tried to bring opensaf up (sudo service opensafd start) on the two 
> controller nodes, or just one of them, it either fails to start or crashes 
> the controller node. I checked, and all the merged imm.xml files are 
> identical on all nodes, and all the software is installed identically for 
> both the unmerged and merged cases.
>
> My thought is that OpenSAF cannot orchestrate the startup of the cluster with 
> that many services (outside of its own well-orchestrated startup sequence for 
> the native opensaf service taxonomy) and gets in a traffic jam internally, 
> hangs or crashes, but does not start.
>
> There appears to have been a service called SCAP sometime in the past, where 
> you would modify a file called NCSSystemBOM.xml to add your service to get 
> started when OpenSAF first comes up, but that seems not to be the case 
> anymore. Did that framework for startup disappear, or get replaced by 
> something else?
>
> Or is there some magic thing I need to do when I do immxml-merge 
> --ignore-variants, to allow OpenSAF to come up with that merged imm.xml 
> without hanging? If you could not do that magic thing, there would be no use 
> for immxml-merge, and it would probably not exist for long, so there must be 
> a magic thing, I reckon!
>
I have not used immxml-merge --ignore-variants any time.
Framework is in place. Any AMF modeled application will come up during
cluster start  after expiry of cluster startup timer if all AMF model
objects are proper
Please share the error messages/syslog and also imm.xml if possible.

Thanks.
Praveen
> Charlie ...
>
> -----Original Message-----
> From: praveen malviya [mailto:[email protected]]
> Sent: Sunday, February 15, 2015 8:12 PM
> To: Johnson, Charles; [email protected]
> Subject: Re: [users] amf-adm question ...
>
>
>
> On 14-Feb-15 12:40 AM, Johnson, Charles wrote:
>>
>> Interesting. When I try to move the AmfDemo, and run it from PL-4 and PL-3, 
>> instead of SC-1 and SC2, but it fails to load.
>>
>> What I did was to change the AppConfig-2N.xml file doing those
>> substitutions, the text is included below (it's not long.)
>>
>> The log states that it cannot find the script which is in the same place it 
>> was on all the nodes (/opt/amf_demo/amf_demo_script), or that it is corrupt, 
>> which it is not.
>>
>> Works fine in the controller nodes, not in the payload nodes: am I missing 
>> some limitation regarding Amf?
> Please configure below mentioned attribute in SU obejct to host it on a 
> desired a node. In the sample configuration this attribute is not configured.
> <attr>
>          <name>saAmfSUHostNodeOrNodeGroup</name>
>          <value>safAmfNode=SC-1,safAmfCluster=myAmfCluster</value>
> </attr>
>
> See below the configuration with changes.
>
> Thanks
> Praveen
>



------------------------------

Message: 3
Date: Thu, 19 Mar 2015 19:55:24 +0000
From: Shu Wang <[email protected]>
Subject: [users] Service Units are in Terminating State
To: "[email protected]"
        <[email protected]>
Cc: David S Thompson <[email protected]>,   Lisa Ann
        Lentz-Liddell <[email protected]>, William R  Elliott
        <[email protected]>
Message-ID:
        <3bd0b3dd1eb0044ebb3242d42126621e1bb80...@planetdb3.netcracker.com>
Content-Type: text/plain; charset="us-ascii"

We have a scenario when nodes lost contact for 10 seconds and rejoined, some 
service units ended up in Terminating state.

For example, the following message was seen from /var/log/messages:
NO Lost contact with 'appbox'

We saw some service units on the same box disabled. Then we performed lock and 
lock-in on the disabled service unit:
amf-adm lock safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
amf-adm lock-in safSu=amfSU2.1,safSg=amfSG2,safApp=myApp

Then we tried the following commands:
amf-adm repaired safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
amf-adm unlock-in safSu=amfSU2.1,safSg=amfSG2,safApp=myApp

For either repaired or unlock-in, we got the following error:
error - command timed out (alarm)

SU state stayed as:
safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
         saAmfSUAdminState=LOCKED-INSTANTIATION(3)
         saAmfSUOperState=ENABLED(1)
         saAmfSUPresenceState=TERMINATING(4)
         saAmfSUReadinessState=OUT-OF-SERVICE(1)

Eventually we had to stop the node and restart the node to bring things back to 
normal.

Why disabled service unit stuck at TERMINATING state?  What made a service unit 
stuck at TERMINATING state?
If a node is lost for a little while, what are the effects of the node lost 
contact in the cluster?
How to repair the damage caused by the node lost?

Thanks!

Shu Wang | Senior Analyst | +1(407)708-5117 or x3917| www.NetCracker.com
Proven Partner to Communications Service Providers




________________________________
The information transmitted herein is intended only for the person or entity to 
which it is addressed and may contain confidential, proprietary and/or 
privileged material. Any review, retransmission, dissemination or other use of, 
or taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received this 
in error, please contact the sender and delete the material from any computer.


------------------------------

Message: 4
Date: Fri, 20 Mar 2015 11:26:36 +0530
From: A V Mahesh <[email protected]>
Subject: Re: [users] Service Units are in Terminating State
To: [email protected], [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset=windows-1252; format=flowed

Hi Shu Wang,

It seems you are using TCP transport , please provide the following data.


On 3/20/2015 1:25 AM, Shu Wang wrote:
> Then nodes lost contact for 10 seconds and rejoined

Can you please share your observation , why the fault node didn't go for
reboot ?
Have you customized/any of  the /etc/opensaf/dtmd.conf configuration ?
please share  /var/log/messages  SC-1 & SC-2

<SHU> They went for reboot, but not successful, went to disabled eventually.

We had a payload reboot, active controller reboot, standby controller became 
active.
Then we tried to repair the cluster by stop and start the disabled the nodes.
The payload node with terminating SUs was not in disabled state. However, 
stop/start the code gets the terminating SUs back to uninstantiated state, we 
were able to unlock instantiation and unlock the SUs to bring things back to 
normal.

/etc/opensaf/dtmd.conf has been customized with:
DTM_TCP_KEEPALIVE_PROBES=5
DTM_SOCK_SND_RCV_BUF_SIZE=761856

On 3/20/2015 1:25 AM, Shu Wang wrote:
> For example, the following message was seen from /var/log/messages:
> NO Lost contact with 'appbox'

In default configuration the `node_name` should be  SC-1 , SC-2 , PL-3
,ect ...
have customized the imm.xml with your node_name`s ?

<SHU> Of course.

-AVM

On 3/20/2015 1:25 AM, Shu Wang wrote:
> We have a scenario when nodes lost contact for 10 seconds and rejoined, some 
> service units ended up in Terminating state.
>
> For example, the following message was seen from /var/log/messages:
> NO Lost contact with 'appbox'
>
> We saw some service units on the same box disabled. Then we performed lock 
> and lock-in on the disabled service unit:
> amf-adm lock safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
> amf-adm lock-in safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
>
> Then we tried the following commands:
> amf-adm repaired safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
> amf-adm unlock-in safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
>
> For either repaired or unlock-in, we got the following error:
> error - command timed out (alarm)
>
> SU state stayed as:
> safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
>           saAmfSUAdminState=LOCKED-INSTANTIATION(3)
>           saAmfSUOperState=ENABLED(1)
>           saAmfSUPresenceState=TERMINATING(4)
>           saAmfSUReadinessState=OUT-OF-SERVICE(1)
>
> Eventually we had to stop the node and restart the node to bring things back 
> to normal.
>
> Why disabled service unit stuck at TERMINATING state?  What made a service 
> unit stuck at TERMINATING state?
> If a node is lost for a little while, what are the effects of the node lost 
> contact in the cluster?
> How to repair the damage caused by the node lost?
>
> Thanks!
>
> Shu Wang | Senior Analyst | +1(407)708-5117 or x3917| www.NetCracker.com
> Proven Partner to Communications Service Providers
>
>
>
>
> ________________________________
> The information transmitted herein is intended only for the person or entity 
> to which it is addressed and may contain confidential, proprietary and/or 
> privileged material. Any review, retransmission, dissemination or other use 
> of, or taking of any action in reliance upon, this information by persons or 
> entities other than the intended recipient is prohibited. If you received 
> this in error, please contact the sender and delete the material from any 
> computer.
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Opensaf-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-users




------------------------------

Message: 5
Date: Fri, 20 Mar 2015 16:44:34 +0530
From: praveen malviya <[email protected]>
Subject: Re: [users] Service Units are in Terminating State
To: Shu Wang <[email protected]>,
        "[email protected]"
        <[email protected]>
Cc: Lisa Ann Lentz-Liddell <[email protected]>,
        David S Thompson <[email protected]>,       William R 
Elliott
        <[email protected]>
Message-ID: <[email protected]>
Content-Type: text/plain; charset=windows-1252; format=flowed

Hi,

Please see questions inline:

On 20-Mar-15 1:25 AM, Shu Wang wrote:
> We have a scenario when nodes lost contact for 10 seconds and rejoined, some 
> service units ended up in Terminating state.
>
> For example, the following message was seen from /var/log/messages:
> NO Lost contact with 'appbox'
>
> We saw some service units on the same box disabled. Then we performed lock 
> and lock-in on the disabled service unit:
> amf-adm lock safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
> amf-adm lock-in safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
>
> Then we tried the following commands:
> amf-adm repaired safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
> amf-adm unlock-in safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
>
> For either repaired or unlock-in, we got the following error:
> error - command timed out (alarm)
>
> SU state stayed as:
> safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
>           saAmfSUAdminState=LOCKED-INSTANTIATION(3)
>           saAmfSUOperState=ENABLED(1)
>           saAmfSUPresenceState=TERMINATING(4)
>           saAmfSUReadinessState=OUT-OF-SERVICE(1)
>

Which OpenSAF release are you using?

<SHU> We use OpenSAF 4.4 RC2

What is the recovery policy of the SU? Do you see any fault reported on
any component of this SU by AMF in the syslog? (like SU failover?)

<SHU> It is SU failover

Also note that, Link flaps are not supported.
Assuming that it is a scenario where the link is brought down (like
interface down or cable plugout - all leading to socket connection loss
with that node) , the ideal behaviour should be that this node leaves
the cluster and cannot join without restart of OpenSAF(including network
connection establishment).


Thanks,
Praveen

> Eventually we had to stop the node and restart the node to bring things back 
> to normal.
>
> Why disabled service unit stuck at TERMINATING state?  What made a service 
> unit stuck at TERMINATING state?
> If a node is lost for a little while, what are the effects of the node lost 
> contact in the cluster?
> How to repair the damage caused by the node lost?
>
> Thanks!
>
> Shu Wang | Senior Analyst | +1(407)708-5117 or x3917| www.NetCracker.com
> Proven Partner to Communications Service Providers
>
>
>
>
> ________________________________
> The information transmitted herein is intended only for the person or entity 
> to which it is addressed and may contain confidential, proprietary and/or 
> privileged material. Any review, retransmission, dissemination or other use 
> of, or taking of any action in reliance upon, this information by persons or 
> entities other than the intended recipient is prohibited. If you received 
> this in error, please contact the sender and delete the material from any 
> computer.
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Opensaf-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>



------------------------------

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/

------------------------------

_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users


End of Opensaf-users Digest, Vol 22, Issue 2
********************************************


________________________________
The information transmitted herein is intended only for the person or entity to 
which it is addressed and may contain confidential, proprietary and/or 
privileged material. Any review, retransmission, dissemination or other use of, 
or taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received this 
in error, please contact the sender and delete the material from any computer.

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Re: [users] Opensaf-users Digest, Vol 22, Issue 2

Reply via email to