Re: [users] Opensaf-users Digest, Vol 22, Issue 2

praveen malviya Mon, 23 Mar 2015 21:33:07 -0700

Hi,
Please response inline with [Praveen].

Thanks,
Praveen


On 24-Mar-15 3:53 AM, Shu Wang wrote:
> My comments started with <SHU>
>
> Shu Wang | Senior Analyst | +1(407)708-5117 or x3917| www.NetCracker.com
> Proven Partner to Communications Service Providers
>
>
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]]
> Sent: Friday, March 20, 2015 7:15 AM
> To: [email protected]
> Subject: Opensaf-users Digest, Vol 22, Issue 2
>
> Send Opensaf-users mailing list submissions to
>          [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
>          https://lists.sourceforge.net/lists/listinfo/opensaf-users
> or, via email, send a message with subject or body 'help' to
>          [email protected]
>
> You can reach the person managing the list at
>          [email protected]
>
> When replying, please edit your Subject line so it is more specific than "Re: 
> Contents of Opensaf-users digest..."
>
>
> Today's Topics:
>
>     1. Re: amf-adm question ... (Johnson, Charles)
>     2. Re: amf-adm question ... (praveen malviya)
>     3. Service Units are in Terminating State (Shu Wang)
>     4. Re: Service Units are in Terminating State (A V Mahesh)
>     5. Re: Service Units are in Terminating State (praveen malviya)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 10 Mar 2015 23:26:25 +0000
> From: "Johnson, Charles" <[email protected]>
> Subject: Re: [users] amf-adm question ...
> To: praveen malviya <[email protected]>,
>          "[email protected]"
>          <[email protected]>
> Message-ID:
>          
> <138f607845c24246a37786deda499d7a0614d...@g6w2495.americas.hpqcorp.net>
>
> Content-Type: text/plain; charset="us-ascii"
>
> So, that works and it scales out, and I encountered yet another issue ...
>
> I scaled this out on ten Ethernet-connected native nodes, running 8 to 10 
> service processes per node in 2N groups and got it working by changing the 
> xml for the payload-only nodes using your changes.
>
> Thank you very much, Praveen !!
>
> Now, I was only able to get this working, by installing all the software on 
> every node (opensaf and my services) and bringing up the vanilla imm.xml for 
> opensaf native services, and then doing the one immcfg and four  amf-adm 
> commands for each service group (42 service groups) in serial order, from a 
> generated bash script file, after opensaf was up and running.
>
> Separately, although I was able to execute the immxml-merge command 
> successfully (using the --ignore-variants flag to get rid of the errors from 
> common objects in each 2N xml file for each service) and thus produce a valid 
> and totally inclusive imm.xml for opensaf + my services (in the exact same 
> object order that the commands from the bash file would add them at runtime), 
> when I tried to bring opensaf up (sudo service opensafd start) on the two 
> controller nodes, or just one of them, it either fails to start or crashes 
> the controller node. I checked, and all the merged imm.xml files are 
> identical on all nodes, and all the software is installed identically for 
> both the unmerged and merged cases.
>
> My thought is that OpenSAF cannot orchestrate the startup of the cluster with 
> that many services (outside of its own well-orchestrated startup sequence for 
> the native opensaf service taxonomy) and gets in a traffic jam internally, 
> hangs or crashes, but does not start.
>
> There appears to have been a service called SCAP sometime in the past, where 
> you would modify a file called NCSSystemBOM.xml to add your service to get 
> started when OpenSAF first comes up, but that seems not to be the case 
> anymore. Did that framework for startup disappear, or get replaced by 
> something else?
>
> Or is there some magic thing I need to do when I do immxml-merge 
> --ignore-variants, to allow OpenSAF to come up with that merged imm.xml 
> without hanging? If you could not do that magic thing, there would be no use 
> for immxml-merge, and it would probably not exist for long, so there must be 
> a magic thing, I reckon!
>
> Charlie ...
>
> -----Original Message-----
> From: praveen malviya [mailto:[email protected]]
> Sent: Sunday, February 15, 2015 8:12 PM
> To: Johnson, Charles; [email protected]
> Subject: Re: [users] amf-adm question ...
>
>
>
> On 14-Feb-15 12:40 AM, Johnson, Charles wrote:
>>
>> Interesting. When I try to move the AmfDemo, and run it from PL-4 and PL-3, 
>> instead of SC-1 and SC2, but it fails to load.
>>
>> What I did was to change the AppConfig-2N.xml file doing those
>> substitutions, the text is included below (it's not long.)
>>
>> The log states that it cannot find the script which is in the same place it 
>> was on all the nodes (/opt/amf_demo/amf_demo_script), or that it is corrupt, 
>> which it is not.
>>
>> Works fine in the controller nodes, not in the payload nodes: am I missing 
>> some limitation regarding Amf?
> Please configure below mentioned attribute in SU obejct to host it on a 
> desired a node. In the sample configuration this attribute is not configured.
> <attr>
>          <name>saAmfSUHostNodeOrNodeGroup</name>
>          <value>safAmfNode=SC-1,safAmfCluster=myAmfCluster</value>
> </attr>
>
> See below the configuration with changes.
>
> Thanks
> Praveen
>
>
>
> ------------------------------
>
> Message: 2
> Date: Fri, 13 Mar 2015 10:52:31 +0530
> From: praveen malviya <[email protected]>
> Subject: Re: [users] amf-adm question ...
> To: "Johnson, Charles" <[email protected]>,
>          "[email protected]"
>          <[email protected]>
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=windows-1252; format=flowed
>
>
>
> On 11-Mar-15 4:56 AM, Johnson, Charles wrote:
>> So, that works and it scales out, and I encountered yet another issue ...
>>
>> I scaled this out on ten Ethernet-connected native nodes, running 8 to 10 
>> service processes per node in 2N groups and got it working by changing the 
>> xml for the payload-only nodes using your changes.
>>
>> Thank you very much, Praveen !!
>>
>> Now, I was only able to get this working, by installing all the software on 
>> every node (opensaf and my services) and bringing up the vanilla imm.xml for 
>> opensaf native services, and then doing the one immcfg and four  amf-adm 
>> commands for each service group (42 service groups) in serial order, from a 
>> generated bash script file, after opensaf was up and running.
>>
>> Separately, although I was able to execute the immxml-merge command 
>> successfully (using the --ignore-variants flag to get rid of the errors from 
>> common objects in each 2N xml file for each service) and thus produce a 
>> valid and totally inclusive imm.xml for opensaf + my services (in the exact 
>> same object order that the commands from the bash file would add them at 
>> runtime), when I tried to bring opensaf up (sudo service opensafd start) on 
>> the two controller nodes, or just one of them, it either fails to start or 
>> crashes the controller node. I checked, and all the merged imm.xml files are 
>> identical on all nodes, and all the software is installed identically for 
>> both the unmerged and merged cases.
>>
>> My thought is that OpenSAF cannot orchestrate the startup of the cluster 
>> with that many services (outside of its own well-orchestrated startup 
>> sequence for the native opensaf service taxonomy) and gets in a traffic jam 
>> internally, hangs or crashes, but does not start.
>>
>> There appears to have been a service called SCAP sometime in the past, where 
>> you would modify a file called NCSSystemBOM.xml to add your service to get 
>> started when OpenSAF first comes up, but that seems not to be the case 
>> anymore. Did that framework for startup disappear, or get replaced by 
>> something else?
>>
>> Or is there some magic thing I need to do when I do immxml-merge 
>> --ignore-variants, to allow OpenSAF to come up with that merged imm.xml 
>> without hanging? If you could not do that magic thing, there would be no use 
>> for immxml-merge, and it would probably not exist for long, so there must be 
>> a magic thing, I reckon!
>>
> I have not used immxml-merge --ignore-variants any time.
> Framework is in place. Any AMF modeled application will come up during
> cluster start  after expiry of cluster startup timer if all AMF model
> objects are proper
> Please share the error messages/syslog and also imm.xml if possible.
>
> Thanks.
> Praveen
>> Charlie ...
>>
>> -----Original Message-----
>> From: praveen malviya [mailto:[email protected]]
>> Sent: Sunday, February 15, 2015 8:12 PM
>> To: Johnson, Charles; [email protected]
>> Subject: Re: [users] amf-adm question ...
>>
>>
>>
>> On 14-Feb-15 12:40 AM, Johnson, Charles wrote:
>>>
>>> Interesting. When I try to move the AmfDemo, and run it from PL-4 and PL-3, 
>>> instead of SC-1 and SC2, but it fails to load.
>>>
>>> What I did was to change the AppConfig-2N.xml file doing those
>>> substitutions, the text is included below (it's not long.)
>>>
>>> The log states that it cannot find the script which is in the same place it 
>>> was on all the nodes (/opt/amf_demo/amf_demo_script), or that it is 
>>> corrupt, which it is not.
>>>
>>> Works fine in the controller nodes, not in the payload nodes: am I missing 
>>> some limitation regarding Amf?
>> Please configure below mentioned attribute in SU obejct to host it on a 
>> desired a node. In the sample configuration this attribute is not configured.
>> <attr>
>>           <name>saAmfSUHostNodeOrNodeGroup</name>
>>           <value>safAmfNode=SC-1,safAmfCluster=myAmfCluster</value>
>> </attr>
>>
>> See below the configuration with changes.
>>
>> Thanks
>> Praveen
>>
>
>
>
> ------------------------------
>
> Message: 3
> Date: Thu, 19 Mar 2015 19:55:24 +0000
> From: Shu Wang <[email protected]>
> Subject: [users] Service Units are in Terminating State
> To: "[email protected]"
>          <[email protected]>
> Cc: David S Thompson <[email protected]>,   Lisa Ann
>          Lentz-Liddell <[email protected]>, William R  
> Elliott
>          <[email protected]>
> Message-ID:
>          <3bd0b3dd1eb0044ebb3242d42126621e1bb80...@planetdb3.netcracker.com>
> Content-Type: text/plain; charset="us-ascii"
>
> We have a scenario when nodes lost contact for 10 seconds and rejoined, some 
> service units ended up in Terminating state.
>
> For example, the following message was seen from /var/log/messages:
> NO Lost contact with 'appbox'
>
> We saw some service units on the same box disabled. Then we performed lock 
> and lock-in on the disabled service unit:
> amf-adm lock safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
> amf-adm lock-in safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
>
> Then we tried the following commands:
> amf-adm repaired safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
> amf-adm unlock-in safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
>
> For either repaired or unlock-in, we got the following error:
> error - command timed out (alarm)
>
> SU state stayed as:
> safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
>           saAmfSUAdminState=LOCKED-INSTANTIATION(3)
>           saAmfSUOperState=ENABLED(1)
>           saAmfSUPresenceState=TERMINATING(4)
>           saAmfSUReadinessState=OUT-OF-SERVICE(1)
>
> Eventually we had to stop the node and restart the node to bring things back 
> to normal.
>
> Why disabled service unit stuck at TERMINATING state?  What made a service 
> unit stuck at TERMINATING state?
> If a node is lost for a little while, what are the effects of the node lost 
> contact in the cluster?
> How to repair the damage caused by the node lost?
>
> Thanks!
>
> Shu Wang | Senior Analyst | +1(407)708-5117 or x3917| www.NetCracker.com
> Proven Partner to Communications Service Providers
>
>
>
>
> ________________________________
> The information transmitted herein is intended only for the person or entity 
> to which it is addressed and may contain confidential, proprietary and/or 
> privileged material. Any review, retransmission, dissemination or other use 
> of, or taking of any action in reliance upon, this information by persons or 
> entities other than the intended recipient is prohibited. If you received 
> this in error, please contact the sender and delete the material from any 
> computer.
>
>
> ------------------------------
>
> Message: 4
> Date: Fri, 20 Mar 2015 11:26:36 +0530
> From: A V Mahesh <[email protected]>
> Subject: Re: [users] Service Units are in Terminating State
> To: [email protected], [email protected]
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=windows-1252; format=flowed
>
> Hi Shu Wang,
>
> It seems you are using TCP transport , please provide the following data.
>
>
> On 3/20/2015 1:25 AM, Shu Wang wrote:
>> Then nodes lost contact for 10 seconds and rejoined
>
> Can you please share your observation , why the fault node didn't go for
> reboot ?
> Have you customized/any of  the /etc/opensaf/dtmd.conf configuration ?
> please share  /var/log/messages  SC-1 & SC-2
>
> <SHU> They went for reboot, but not successful, went to disabled eventually.
>
> We had a payload reboot, active controller reboot, standby controller became 
> active.
> Then we tried to repair the cluster by stop and start the disabled the nodes.
> The payload node with terminating SUs was not in disabled state. However, 
> stop/start the code gets the terminating SUs back to uninstantiated state, we 
> were able to unlock instantiation and unlock the SUs to bring things back to 
> normal.
>
> /etc/opensaf/dtmd.conf has been customized with:
> DTM_TCP_KEEPALIVE_PROBES=5
> DTM_SOCK_SND_RCV_BUF_SIZE=761856
>
> On 3/20/2015 1:25 AM, Shu Wang wrote:
>> For example, the following message was seen from /var/log/messages:
>> NO Lost contact with 'appbox'
>
> In default configuration the `node_name` should be  SC-1 , SC-2 , PL-3
> ,ect ...
> have customized the imm.xml with your node_name`s ?
>
> <SHU> Of course.
>
> -AVM
>
> On 3/20/2015 1:25 AM, Shu Wang wrote:
>> We have a scenario when nodes lost contact for 10 seconds and rejoined, some 
>> service units ended up in Terminating state.
>>
>> For example, the following message was seen from /var/log/messages:
>> NO Lost contact with 'appbox'
>>
>> We saw some service units on the same box disabled. Then we performed lock 
>> and lock-in on the disabled service unit:
>> amf-adm lock safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
>> amf-adm lock-in safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
>>
>> Then we tried the following commands:
>> amf-adm repaired safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
>> amf-adm unlock-in safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
>>
>> For either repaired or unlock-in, we got the following error:
>> error - command timed out (alarm)
>>
>> SU state stayed as:
>> safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
>>            saAmfSUAdminState=LOCKED-INSTANTIATION(3)
>>            saAmfSUOperState=ENABLED(1)
>>            saAmfSUPresenceState=TERMINATING(4)
>>            saAmfSUReadinessState=OUT-OF-SERVICE(1)
>>
>> Eventually we had to stop the node and restart the node to bring things back 
>> to normal.
>>
>> Why disabled service unit stuck at TERMINATING state?  What made a service 
>> unit stuck at TERMINATING state?
>> If a node is lost for a little while, what are the effects of the node lost 
>> contact in the cluster?
>> How to repair the damage caused by the node lost?
>>
>> Thanks!
>>
>> Shu Wang | Senior Analyst | +1(407)708-5117 or x3917| www.NetCracker.com
>> Proven Partner to Communications Service Providers
>>
>>
>>
>>
>> ________________________________
>> The information transmitted herein is intended only for the person or entity 
>> to which it is addressed and may contain confidential, proprietary and/or 
>> privileged material. Any review, retransmission, dissemination or other use 
>> of, or taking of any action in reliance upon, this information by persons or 
>> entities other than the intended recipient is prohibited. If you received 
>> this in error, please contact the sender and delete the material from any 
>> computer.
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming The Go Parallel Website, 
>> sponsored
>> by Intel and developed in partnership with Slashdot Media, is your hub for 
>> all
>> things parallel software development, from weekly thought leadership blogs to
>> news, videos, case studies, tutorials and more. Take a look and join the
>> conversation now. http://goparallel.sourceforge.net/
>> _______________________________________________
>> Opensaf-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>
>
>
>
> ------------------------------
>
> Message: 5
> Date: Fri, 20 Mar 2015 16:44:34 +0530
> From: praveen malviya <[email protected]>
> Subject: Re: [users] Service Units are in Terminating State
> To: Shu Wang <[email protected]>,
>          "[email protected]"
>          <[email protected]>
> Cc: Lisa Ann Lentz-Liddell <[email protected]>,
>          David S Thompson <[email protected]>,       William R 
> Elliott
>          <[email protected]>
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=windows-1252; format=flowed
>
> Hi,
>
> Please see questions inline:
>
> On 20-Mar-15 1:25 AM, Shu Wang wrote:
>> We have a scenario when nodes lost contact for 10 seconds and rejoined, some 
>> service units ended up in Terminating state.
>>
>> For example, the following message was seen from /var/log/messages:
>> NO Lost contact with 'appbox'
>>
>> We saw some service units on the same box disabled. Then we performed lock 
>> and lock-in on the disabled service unit:
>> amf-adm lock safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
>> amf-adm lock-in safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
>>
>> Then we tried the following commands:
>> amf-adm repaired safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
>> amf-adm unlock-in safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
>>
>> For either repaired or unlock-in, we got the following error:
>> error - command timed out (alarm)
>>
>> SU state stayed as:
>> safSu=amfSU2.1,safSg=amfSG2,safApp=myApp
>>            saAmfSUAdminState=LOCKED-INSTANTIATION(3)
>>            saAmfSUOperState=ENABLED(1)
>>            saAmfSUPresenceState=TERMINATING(4)
>>            saAmfSUReadinessState=OUT-OF-SERVICE(1)
>>
>
> Which OpenSAF release are you using?
>
> <SHU> We use OpenSAF 4.4 RC2
Changeset corresponding to this is 5023:a4574e3da4f6.

>
> What is the recovery policy of the SU? Do you see any fault reported on
> any component of this SU by AMF in the syslog? (like SU failover?)
>
> <SHU> It is SU failover.
With recoovery is sufailover, SU can get stuck in Terminating state 
because of issue #1017.Issue #1017 is duplicate of #1015 and both were 
reported on 4.5.FC 5596:b507ef74cc6e.
And #1015 was fixed in 4.4 with changeset:

changeset: 5883:616211c61bab
branch: opensaf-4.4.x
parent: 5880:68d02af29aaa
user: [email protected]
date: Mon Sep 22 15:37:32 2014 +0530
summary: amfnd : do not send susi success response during su-failover 
[#1015]

So in 4.4 Rc2 #1015 patch is not present.
If issue is reproducible then please test it using the patch of #1015.

Thanks,
Praveen
>
> Also note that, Link flaps are not supported.
> Assuming that it is a scenario where the link is brought down (like
> interface down or cable plugout - all leading to socket connection loss
> with that node) , the ideal behaviour should be that this node leaves
> the cluster and cannot join without restart of OpenSAF(including network
> connection establishment).
>
>
> Thanks,
> Praveen
>
>> Eventually we had to stop the node and restart the node to bring things back 
>> to normal.
>>
>> Why disabled service unit stuck at TERMINATING state?  What made a service 
>> unit stuck at TERMINATING state?
>> If a node is lost for a little while, what are the effects of the node lost 
>> contact in the cluster?
>> How to repair the damage caused by the node lost?
>>
>> Thanks!
>>
>> Shu Wang | Senior Analyst | +1(407)708-5117 or x3917| www.NetCracker.com
>> Proven Partner to Communications Service Providers
>>
>>
>>
>>
>> ________________________________
>> The information transmitted herein is intended only for the person or entity 
>> to which it is addressed and may contain confidential, proprietary and/or 
>> privileged material. Any review, retransmission, dissemination or other use 
>> of, or taking of any action in reliance upon, this information by persons or 
>> entities other than the intended recipient is prohibited. If you received 
>> this in error, please contact the sender and delete the material from any 
>> computer.
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming The Go Parallel Website, 
>> sponsored
>> by Intel and developed in partnership with Slashdot Media, is your hub for 
>> all
>> things parallel software development, from weekly thought leadership blogs to
>> news, videos, case studies, tutorials and more. Take a look and join the
>> conversation now. http://goparallel.sourceforge.net/
>> _______________________________________________
>> Opensaf-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>
>
>
>
> ------------------------------
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
>
> ------------------------------
>
> _______________________________________________
> Opensaf-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>
>
> End of Opensaf-users Digest, Vol 22, Issue 2
> ********************************************
>
>
> ________________________________
> The information transmitted herein is intended only for the person or entity 
> to which it is addressed and may contain confidential, proprietary and/or 
> privileged material. Any review, retransmission, dissemination or other use 
> of, or taking of any action in reliance upon, this information by persons or 
> entities other than the intended recipient is prohibited. If you received 
> this in error, please contact the sender and delete the material from any 
> computer.
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Opensaf-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Re: [users] Opensaf-users Digest, Vol 22, Issue 2

Reply via email to