Re: [users] EXTERNAL: RE: IMM "Try Again" for Admin Commands - help - clarification

Carroll, James R Fri, 06 Jul 2018 12:02:10 -0700

Hi Nagendra,

Thank you so much for your informative response.  I do have some follow up 
questions, however.



  1.  Can you clarify your suggestion: “You can also think about performing 
node group operation”.  According to the AMF specification B.04.01, Section 
8.7, “No administrative operations are defined for a node group”.   I am not 
sure how this can be used to resolve sending admin commands to the individual 
nodes.
  2.  Can you clarify the OpenSAF behavior for the following scenario:
     *   Controller Node A, needs to send Admin Command “Shutdown”, to Payload 
Nodes, in following order

                                      i.     Payload Node B – Admin Shutdown

                                     ii.     Payload Node C – Admin Shutdown

                                   iii.     Payload Node D – Admin Shutdown

                                   iv.     Payload Node E – Admin Shutdown

     *   Our EXPECTATION of the above scenario: The sending of an ADMIN command 
to Node B, is independent of Nodes C, or D,  or E.  Therefore, all 4 commands 
should be issued by OpenSAF, in Parallel, with no dependency between nodes.  
Therefore, none of the Shutdown commands should be responded to with TRY_AGAIN.
     *   Our OBSERVATION appears to be showing: OpenSAF issues Admin Command to 
Node B.  Then, the commands to nodes C, D, and E, will not execute, until Node 
B has completed.  In other words, it appears to be sequentially dependent.   
Nodes C, D, and E are getting TRY_AGAIN.  Once Node B is done, then Node C 
begins shutting down, and Nodes D and E get TRY_AGAIN.  And so on for the 
remaining nodes.

Thanks.

Jim



From: [email protected] <[email protected]>
Sent: Friday, July 06, 2018 1:25 AM
To: Carroll, James R (US) <[email protected]>; 
[email protected]
Subject: EXTERNAL: RE: [users] IMM "Try Again" for Admin Commands - help

Hi Jim,

The following are the most probable reasons for getting TRY_AGAIN for node 
admin operations (node lock/shutdown). I assume the components/applications are 
SA-Aware(if not then equivalent actions from Amf can be correlated).

For node lock/shutdown admin operations:
- The components/applications receiving quiscing/quisced/removed callbacks are 
taking time to respond to Amf.
- The components/applications receiving Active callbacks at another node 
(because of lock issued on the current node and there was Standby Service unit 
at another node) are taking time to respond to Amf.

Until, the components/applications don't respond to Amf Callbacks, Amf will 
return TRY_AGAIN for further admin operation on the node.

This is expected behavior because until one admin operation is not successful 
on the entities, another admin operation can't be accepted until some more 
time(So, the admin operations get TRY_AGAIN).

Suggestion:
Step 1: Debugging of the application responses time.
Step 2: If the application taking time to respond because of genuine reasons, 
then you can have a script performing admin operations, the script should 
handle TRY_AGAIN.
Step 3: You can also think about performing node group operation.

Please find some point-to-point responses:
>>need to understand why the IMM is busy.
In my understanding, Imm is not busy, rather Amf is not getting callback 
responses from applications and Amf is returning TRY_AGAIN to Imm, which in 
tern returning TRY_AGAIN to applications issuing admin operation.
>>how long to wait until the operations can be performed.
Until all the callbacks are not responded to Amf, the admin operations will 
return TRY_AGAIN.
>>Is this a known and documented issue?
It is defined by Specifications to return TRY_AGAIN by Service if the operation 
can't be accepted at that time.
>>Is it possible that this issue has been addressed in a later release that we 
>>can capture?
This behavior of returning TRY_AGAIN is the same in all the releases.
>>Are there any accepted practices or guidelines on how to deal with this 
>>condition?
As suggested in Step 2, you can keep sleep for milli/micro seconds if get 
TRY_AGAIN and then call admin operation again in your script or applications 
issuing admin operations.

Hope that helps.


Thanks,
Nagendra, 91-9866424860
www.hasolutions.in<http://www.hasolutions.in>
High Availability Solutions Pvt. Ltd.
 - High Availability Solutions Provider.





--------- Original Message ---------
Subject: [users] IMM "Try Again" for Admin Commands - help
From: "Carroll, James R" 
<[email protected]<mailto:[email protected]>>
Date: 7/5/18 9:40 pm
To: 
"[email protected]<mailto:[email protected]>"
 
<[email protected]<mailto:[email protected]>>

All,

We are using OpenSAF 5.2.0, and are experiencing issues with Admin commands to 
perform NODE operations. We are getting multiple responses of TRY_AGAIN, and 
need to understand why the IMM is busy, and how long to wait until the 
operations can be performed.

Some background for the Admin commands being performed. We have a single 
controller node, and 4 payload nodes. In our current configuration, OpenSAF 
controller node is only housing OpenSAF daemons, there are no user developed 
applications running on the controller node. In addition, we have all 4 payload 
nodes up and running essentially idle, with minimal load. We issue an ADMIN 
command to shutdown each of the Payload nodes (the controller node is 
unaffected). Each of the admin commands responds with TRY_AGAIN. And then we 
have to wait arbitrary times, then try again, until the IMM accepts the 
command, for each node. In our view of this scenario, these are near-perfect 
conditions for OpenSAF: the controller has its own node, and the system is 
fully idle. Yet we continue to re-issue the ADMIN command, and we get a 
response of busy, try again. Eventually, each command is accepted (one for each 
payload node), and then we can issue the Lock Instantiation. Note - we have 
also tried scenarios using the LOCK, and LOCK_Instantiate sequence, instead of 
SHUTDOWN, and see similar behavior.

Is this a known and documented issue? Is it possible that this issue has been 
addressed in a later release that we can capture?
Are there any accepted practices or guidelines on how to deal with this 
condition?

Thank you.

Jim

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-users mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/opensaf-users
Hi Jim,

The following are the most probable reasons for getting TRY_AGAIN for node 
admin operations (node lock/shutdown). I assume the components/applications are 
SA-Aware(if not then equivalent actions from Amf can be correlated).

For node lock/shutdown admin operations:
- The components/applications receiving quiscing/quisced/removed callbacks are 
taking time to respond to Amf.
- The components/applications receiving Active callbacks at another node 
(because of lock issued on the current node and there was Standby Service unit 
at another node) are taking time to respond to Amf.

Until, the components/applications don't respond to Amf Callbacks, Amf will 
return TRY_AGAIN for further admin operation on the node.

This is expected behaviour because until one admiin operation is not successful 
on the entities, another admin operation can't be accepeted until some more 
time(So, the admin operations get TRY_AGAIN).

Suggestion:
Step 1: Debugging of the application responses time.
Step 2: If the application taking time to respond because of genuine reasons, 
then you can have a script performing admin operations, the script should 
handle TRY_AGAIN.
Step 3: You can also think about performing node group operation.

Please find some point-to-point responses:
>>need to understand why the IMM is busy.
In my umderstanding, Imm is not busy, rather Amf is not getting callback 
responses from applications and Amf is returning TRY_AGAIN to Imm, which in 
tern returning TRY_AGAIN to applications issuing admin operation.
>>how long to wait until the operations can be performed.
Untill all the callbacks are not responded to Amf, the admin operations will 
return TRY_AGAIN.
>>Is this a known and documented issue?
It is defined by Specifications to return TRY_AGAIN by Service if the operation 
can't be accepcted at that time.
>>Is it possible that this issue has been addressed in a later release that we 
>>can capture?
This behaviour of returning TRY_AGAIN is the same in all the releases.
>>Are there any accepted practices or guidelines on how to deal with this 
>>condition?
As suggested in Step 2, you can keep sleep for milli/micro seconds if get 
TRY_AGAIN and then call admin operation again in your script or applications 
issuing admin operations.

Hope that helps.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Re: [users] EXTERNAL: RE: IMM "Try Again" for Admin Commands - help - clarification

Reply via email to