Hi Nagendra,
Thank you so much for your informative response. I do have some follow up
questions, however.
1. Can you clarify your suggestion: “You can also think about performing
node group operation”. According to the AMF specification B.04.01, Section
8.7, “No administrative operations are defined for a node group”. I am not
sure how this can be used to resolve sending admin commands to the individual
nodes.
2. Can you clarify the OpenSAF behavior for the following scenario:
* Controller Node A, needs to send Admin Command “Shutdown”, to Payload
Nodes, in following order
i. Payload Node B – Admin Shutdown
ii. Payload Node C – Admin Shutdown
iii. Payload Node D – Admin Shutdown
iv. Payload Node E – Admin Shutdown
* Our EXPECTATION of the above scenario: The sending of an ADMIN command
to Node B, is independent of Nodes C, or D, or E. Therefore, all 4 commands
should be issued by OpenSAF, in Parallel, with no dependency between nodes.
Therefore, none of the Shutdown commands should be responded to with TRY_AGAIN.
* Our OBSERVATION appears to be showing: OpenSAF issues Admin Command to
Node B. Then, the commands to nodes C, D, and E, will not execute, until Node
B has completed. In other words, it appears to be sequentially dependent.
Nodes C, D, and E are getting TRY_AGAIN. Once Node B is done, then Node C
begins shutting down, and Nodes D and E get TRY_AGAIN. And so on for the
remaining nodes.
Thanks.
Jim
From: [email protected] <[email protected]>
Sent: Friday, July 06, 2018 1:25 AM
To: Carroll, James R (US) <[email protected]>;
[email protected]
Subject: EXTERNAL: RE: [users] IMM "Try Again" for Admin Commands - help
Hi Jim,
The following are the most probable reasons for getting TRY_AGAIN for node
admin operations (node lock/shutdown). I assume the components/applications are
SA-Aware(if not then equivalent actions from Amf can be correlated).
For node lock/shutdown admin operations:
- The components/applications receiving quiscing/quisced/removed callbacks are
taking time to respond to Amf.
- The components/applications receiving Active callbacks at another node
(because of lock issued on the current node and there was Standby Service unit
at another node) are taking time to respond to Amf.
Until, the components/applications don't respond to Amf Callbacks, Amf will
return TRY_AGAIN for further admin operation on the node.
This is expected behavior because until one admin operation is not successful
on the entities, another admin operation can't be accepted until some more
time(So, the admin operations get TRY_AGAIN).
Suggestion:
Step 1: Debugging of the application responses time.
Step 2: If the application taking time to respond because of genuine reasons,
then you can have a script performing admin operations, the script should
handle TRY_AGAIN.
Step 3: You can also think about performing node group operation.
Please find some point-to-point responses:
>>need to understand why the IMM is busy.
In my understanding, Imm is not busy, rather Amf is not getting callback
responses from applications and Amf is returning TRY_AGAIN to Imm, which in
tern returning TRY_AGAIN to applications issuing admin operation.
>>how long to wait until the operations can be performed.
Until all the callbacks are not responded to Amf, the admin operations will
return TRY_AGAIN.
>>Is this a known and documented issue?
It is defined by Specifications to return TRY_AGAIN by Service if the operation
can't be accepted at that time.
>>Is it possible that this issue has been addressed in a later release that we
>>can capture?
This behavior of returning TRY_AGAIN is the same in all the releases.
>>Are there any accepted practices or guidelines on how to deal with this
>>condition?
As suggested in Step 2, you can keep sleep for milli/micro seconds if get
TRY_AGAIN and then call admin operation again in your script or applications
issuing admin operations.
Hope that helps.
Thanks,
Nagendra, 91-9866424860
www.hasolutions.in<http://www.hasolutions.in>
High Availability Solutions Pvt. Ltd.
- High Availability Solutions Provider.
--------- Original Message ---------
Subject: [users] IMM "Try Again" for Admin Commands - help
From: "Carroll, James R"
<[email protected]<mailto:[email protected]>>
Date: 7/5/18 9:40 pm
To:
"[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
All,
We are using OpenSAF 5.2.0, and are experiencing issues with Admin commands to
perform NODE operations. We are getting multiple responses of TRY_AGAIN, and
need to understand why the IMM is busy, and how long to wait until the
operations can be performed.
Some background for the Admin commands being performed. We have a single
controller node, and 4 payload nodes. In our current configuration, OpenSAF
controller node is only housing OpenSAF daemons, there are no user developed
applications running on the controller node. In addition, we have all 4 payload
nodes up and running essentially idle, with minimal load. We issue an ADMIN
command to shutdown each of the Payload nodes (the controller node is
unaffected). Each of the admin commands responds with TRY_AGAIN. And then we
have to wait arbitrary times, then try again, until the IMM accepts the
command, for each node. In our view of this scenario, these are near-perfect
conditions for OpenSAF: the controller has its own node, and the system is
fully idle. Yet we continue to re-issue the ADMIN command, and we get a
response of busy, try again. Eventually, each command is accepted (one for each
payload node), and then we can issue the Lock Instantiation. Note - we have
also tried scenarios using the LOCK, and LOCK_Instantiate sequence, instead of
SHUTDOWN, and see similar behavior.
Is this a known and documented issue? Is it possible that this issue has been
addressed in a later release that we can capture?
Are there any accepted practices or guidelines on how to deal with this
condition?
Thank you.
Jim
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-users mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/opensaf-users
Hi Jim,
The following are the most probable reasons for getting TRY_AGAIN for node
admin operations (node lock/shutdown). I assume the components/applications are
SA-Aware(if not then equivalent actions from Amf can be correlated).
For node lock/shutdown admin operations:
- The components/applications receiving quiscing/quisced/removed callbacks are
taking time to respond to Amf.
- The components/applications receiving Active callbacks at another node
(because of lock issued on the current node and there was Standby Service unit
at another node) are taking time to respond to Amf.
Until, the components/applications don't respond to Amf Callbacks, Amf will
return TRY_AGAIN for further admin operation on the node.
This is expected behaviour because until one admiin operation is not successful
on the entities, another admin operation can't be accepeted until some more
time(So, the admin operations get TRY_AGAIN).
Suggestion:
Step 1: Debugging of the application responses time.
Step 2: If the application taking time to respond because of genuine reasons,
then you can have a script performing admin operations, the script should
handle TRY_AGAIN.
Step 3: You can also think about performing node group operation.
Please find some point-to-point responses:
>>need to understand why the IMM is busy.
In my umderstanding, Imm is not busy, rather Amf is not getting callback
responses from applications and Amf is returning TRY_AGAIN to Imm, which in
tern returning TRY_AGAIN to applications issuing admin operation.
>>how long to wait until the operations can be performed.
Untill all the callbacks are not responded to Amf, the admin operations will
return TRY_AGAIN.
>>Is this a known and documented issue?
It is defined by Specifications to return TRY_AGAIN by Service if the operation
can't be accepcted at that time.
>>Is it possible that this issue has been addressed in a later release that we
>>can capture?
This behaviour of returning TRY_AGAIN is the same in all the releases.
>>Are there any accepted practices or guidelines on how to deal with this
>>condition?
As suggested in Step 2, you can keep sleep for milli/micro seconds if get
TRY_AGAIN and then call admin operation again in your script or applications
issuing admin operations.
Hope that helps.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users