Hi Jim, >>Can you clarify your suggestion Admin operation on node group is an extension feature on top of Amf Specifications. This feature has been implemented in OpenSAF 4.6 release. Since you are using OpenSAF 5.2.0 release, you have this feature in the deployed systems. This has been implemented to cater Scale-in/Scale-out scenarios of Cloud, where it is desired to shutdown/start multiple nodes in one-shot based on resource demand fluctuations.
Please download "OpenSAF_AMF_PR.odt" from https://sourceforge.net/p/opensaf/documentation/ci/default/tree/ and refer Section 2.2.10 of AMF programmers reference doc as pointed by Gary. How to perform admin operations on node group? Steps to create node group and perform admin operations on node group: Step #1: Create a node group object (mygroup) containing two nodes PL-3 and PL-4 by following commands (In your case, there will be 4 nodes PL-3, PL-4, PL-5, PL-6): immcfg -c SaAmfNodeGroup -a saAmfNGNodeList="safAmfNode=PL-3,safAmfCluster=myAmfCluster" -a saAmfNGNodeList="safAmfNode=PL-4,safAmfCluster=myAmfCluster" safAmfNodeGroup=mygroup,safAmfCluster=myAmfCluster The node group object creation and its contents can be validated by the following command: immlist safAmfNodeGroup=mygroup,safAmfCluster=myAmfCluster Name Type Value(s) ======================================================================== safAmfNodeGroup SA_STRING_T safAmfNodeGroup=mygroup saAmfNGNodeList SA_NAME_T safAmfNode=PL-3,safAmfCluster=myAmfCluster (42) safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) saAmfNGAdminState SA_UINT32_T 1 (0x1) SaImmAttrImplementerName SA_STRING_T safAmfService SaImmAttrClassName SA_STRING_T SaAmfNodeGroup SaImmAttrAdminOwnerName SA_STRING_T <Empty> Step #2: Perform admin operation. Perform lock operation by the following command: amf-adm lock safAmfNodeGroup=mygroup,safAmfCluster=myAmfCluster The operation's success can be validated by checking saAmfNGAdminState, it should be in locked(2) state: immlist safAmfNodeGroup=mygroup,safAmfCluster=myAmfCluster Name Type Value(s) ======================================================================== safAmfNodeGroup SA_STRING_T safAmfNodeGroup=mygroup saAmfNGNodeList SA_NAME_T safAmfNode=PL-3,safAmfCluster=myAmfCluster (42) safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) saAmfNGAdminState SA_UINT32_T 2 (0x2) SaImmAttrImplementerName SA_STRING_T safAmfService SaImmAttrClassName SA_STRING_T SaAmfNodeGroup SaImmAttrAdminOwnerName SA_STRING_T <Empty> Further lock-in and other commands can be performed as below: amf-adm lock-in safAmfNodeGroup=mygroup,safAmfCluster=myAmfCluster amf-adm unlock-in safAmfNodeGroup=mygroup,safAmfCluster=myAmfCluster amf-adm unlock safAmfNodeGroup=mygroup,safAmfCluster=myAmfCluster Step #3: Delete the node group after the completion of admin operations (Or Can be kept for further admin operations): immcfg -d safAmfNodeGroup=mygroup,safAmfCluster=myAmfCluster Note: Please note that the admin state of nodes in the node group is the same as node group i.e. after Step #1 the admin state of node group and its nodes is in locked state. After Step #2 the admin state of node group and its nodes is in locked-in state. Admin operations can be performed on individual node to change the admin state of that node as below: amf-adm unlock-in safAmfNode=PL-3,safAmfCluster=myAmfCluster amf-adm unlock safAmfNode=PL-3,safAmfCluster=myAmfCluster amf-adm unlock-in safAmfNode=PL-4,safAmfCluster=myAmfCluster amf-adm unlock safAmfNode=PL-4,safAmfCluster=myAmfCluster >>Can you clarify the OpenSAF behavior for the following scenario Based on the information provided, I think each node B,C,D and E are having at least one SU of a common SG. I have tried to explain 'TRY_AGAIN return scenario' by an animation in the ppt attached. Hope this helps. Let me know if you have any follow up questions. Thanks, Nagendra, 91-9866424860 www.hasolutions.in https://www.linkedin.com/company/hasolutions/ High Availability Solutions Pvt. Ltd. - We provide OpenSAF support and services --------- Original Message --------- Subject: RE: EXTERNAL: RE: [users] IMM "Try Again" for Admin Commands - help - clarification From: "Carroll, James R" <[email protected]> Date: 7/7/18 12:31 am To: "[email protected]" <[email protected]>, "[email protected]" <[email protected]> Hi Nagendra, Thank you so much for your informative response. I do have some follow up questions, however. Can you clarify your suggestion: “You can also think about performing node group operation”. According to the AMF specification B.04.01, Section 8.7, “No administrative operations are defined for a node group”. I am not sure how this can be used to resolve sending admin commands to the individual nodes. Can you clarify the OpenSAF behavior for the following scenario: Controller Node A, needs to send Admin Command “Shutdown”, to Payload Nodes, in following order <![if !supportLists]> i. <![endif]>Payload Node B - Admin Shutdown <![if !supportLists]> ii. <![endif]>Payload Node C - Admin Shutdown <![if !supportLists]> iii. <![endif]>Payload Node D - Admin Shutdown <![if !supportLists]> iv. <![endif]>Payload Node E - Admin Shutdown Our EXPECTATION of the above scenario: The sending of an ADMIN command to Node B, is independent of Nodes C, or D, or E. Therefore, all 4 commands should be issued by OpenSAF, in Parallel, with no dependency between nodes. Therefore, none of the Shutdown commands should be responded to with TRY_AGAIN. Our OBSERVATION appears to be showing: OpenSAF issues Admin Command to Node B. Then, the commands to nodes C, D, and E, will not execute, until Node B has completed. In other words, it appears to be sequentially dependent. Nodes C, D, and E are getting TRY_AGAIN. Once Node B is done, then Node C begins shutting down, and Nodes D and E get TRY_AGAIN. And so on for the remaining nodes. Thanks. Jim From: [email protected] <[email protected]> Sent: Friday, July 06, 2018 1:25 AM To: Carroll, James R (US) <[email protected]>; [email protected] Subject: EXTERNAL: RE: [users] IMM "Try Again" for Admin Commands - help Hi Jim, The following are the most probable reasons for getting TRY_AGAIN for node admin operations (node lock/shutdown). I assume the components/applications are SA-Aware(if not then equivalent actions from Amf can be correlated). For node lock/shutdown admin operations: - The components/applications receiving quiscing/quisced/removed callbacks are taking time to respond to Amf. - The components/applications receiving Active callbacks at another node (because of lock issued on the current node and there was Standby Service unit at another node) are taking time to respond to Amf. Until, the components/applications don't respond to Amf Callbacks, Amf will return TRY_AGAIN for further admin operation on the node. This is expected behavior because until one admin operation is not successful on the entities, another admin operation can't be accepted until some more time(So, the admin operations get TRY_AGAIN). Suggestion: Step 1: Debugging of the application responses time. Step 2: If the application taking time to respond because of genuine reasons, then you can have a script performing admin operations, the script should handle TRY_AGAIN. Step 3: You can also think about performing node group operation. Please find some point-to-point responses: >>need to understand why the IMM is busy. In my understanding, Imm is not busy, rather Amf is not getting callback responses from applications and Amf is returning TRY_AGAIN to Imm, which in tern returning TRY_AGAIN to applications issuing admin operation. >>how long to wait until the operations can be performed. Until all the callbacks are not responded to Amf, the admin operations will return TRY_AGAIN. >>Is this a known and documented issue? It is defined by Specifications to return TRY_AGAIN by Service if the operation can't be accepted at that time. >>Is it possible that this issue has been addressed in a later release that we >>can capture? This behavior of returning TRY_AGAIN is the same in all the releases. >>Are there any accepted practices or guidelines on how to deal with this >>condition? As suggested in Step 2, you can keep sleep for milli/micro seconds if get TRY_AGAIN and then call admin operation again in your script or applications issuing admin operations. Hope that helps. Thanks, Nagendra, 91-9866424860 www.hasolutions.in High Availability Solutions Pvt. Ltd. - High Availability Solutions Provider. --------- Original Message --------- Subject: [users] IMM "Try Again" for Admin Commands - help From: "Carroll, James R" <[email protected]> Date: 7/5/18 9:40 pm To: "[email protected]" <[email protected]> All, We are using OpenSAF 5.2.0, and are experiencing issues with Admin commands to perform NODE operations. We are getting multiple responses of TRY_AGAIN, and need to understand why the IMM is busy, and how long to wait until the operations can be performed. Some background for the Admin commands being performed. We have a single controller node, and 4 payload nodes. In our current configuration, OpenSAF controller node is only housing OpenSAF daemons, there are no user developed applications running on the controller node. In addition, we have all 4 payload nodes up and running essentially idle, with minimal load. We issue an ADMIN command to shutdown each of the Payload nodes (the controller node is unaffected). Each of the admin commands responds with TRY_AGAIN. And then we have to wait arbitrary times, then try again, until the IMM accepts the command, for each node. In our view of this scenario, these are near-perfect conditions for OpenSAF: the controller has its own node, and the system is fully idle. Yet we continue to re-issue the ADMIN command, and we get a response of busy, try again. Eventually, each command is accepted (one for each payload node), and then we can issue the Lock Instantiation. Note - we have also tried scenarios using the LOCK, and LOCK_Instantiate sequence, instead of SHUTDOWN, and see similar behavior. Is this a known and documented issue? Is it possible that this issue has been addressed in a later release that we can capture? Are there any accepted practices or guidelines on how to deal with this condition? Thank you. Jim ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users Hi Jim, The following are the most probable reasons for getting TRY_AGAIN for node admin operations (node lock/shutdown). I assume the components/applications are SA-Aware(if not then equivalent actions from Amf can be correlated). For node lock/shutdown admin operations: - The components/applications receiving quiscing/quisced/removed callbacks are taking time to respond to Amf. - The components/applications receiving Active callbacks at another node (because of lock issued on the current node and there was Standby Service unit at another node) are taking time to respond to Amf. Until, the components/applications don't respond to Amf Callbacks, Amf will return TRY_AGAIN for further admin operation on the node. This is expected behaviour because until one admiin operation is not successful on the entities, another admin operation can't be accepeted until some more time(So, the admin operations get TRY_AGAIN). Suggestion: Step 1: Debugging of the application responses time. Step 2: If the application taking time to respond because of genuine reasons, then you can have a script performing admin operations, the script should handle TRY_AGAIN. Step 3: You can also think about performing node group operation. Please find some point-to-point responses: >>need to understand why the IMM is busy. In my umderstanding, Imm is not busy, rather Amf is not getting callback responses from applications and Amf is returning TRY_AGAIN to Imm, which in tern returning TRY_AGAIN to applications issuing admin operation. >>how long to wait until the operations can be performed. Untill all the callbacks are not responded to Amf, the admin operations will return TRY_AGAIN. >>Is this a known and documented issue? It is defined by Specifications to return TRY_AGAIN by Service if the operation can't be accepcted at that time. >>Is it possible that this issue has been addressed in a later release that we >>can capture? This behaviour of returning TRY_AGAIN is the same in all the releases. >>Are there any accepted practices or guidelines on how to deal with this >>condition? As suggested in Step 2, you can keep sleep for milli/micro seconds if get TRY_AGAIN and then call admin operation again in your script or applications issuing admin operations. Hope that helps. ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
