Re: [devel] [PATCH 1 of 1] ntfa: return ERR_UNAVAILABLE on non-member node after headless state [#1744]

praveen malviya Thu, 21 Apr 2016 00:28:24 -0700

Hi Minh,

Return code ERR_UNAVAILABLE is not an indication for any client thatnode has lost CLM membership because same return code is given for astale client when node again becomes member.Also when node loses membership and a client gets ERR_UNAVAILABLE, itwill finalize all handles. After finalizing again client needs anindication that node has joined the membership (it cannot try in whileloop for ERR_UNAVAILABLE on sa<*>initialize()). Such an indication, thisclient an get only when it is a client of CLM also with trackerinterface. Tracker interface APIs works on non-menber nodes also but itgives only local node information on non member node and such a clientneeds only this much information. So upon receiving CLM callback forlocal node joining, this client will go and call sa,*>Initialize() andthis call will succeed. In this way,I think, for normal cluster it isthe responsibility of client process to detect node membership status bybecoming CLM client also.

In headless state, CLM member ship status of client nodes is notremember by directors and they will have to rely on new CLM callbacksafter first controller comes up. At the same time CLM client will alsoget BAD_HANDLE after first controller comes up. Considering thissituation, attached patch 1744_addon.patch will give dummy event toclient so that it can call saNtfDispatch. It will work for both headlessand non-headless cluster. But this topic can be revisited a) 5.1 whenall services only on CLM indication or b)when we have more clarity onCLM status of nodes during headless.

Attached patch in on top of #1744 patch and it fixes ntfsubscribe alsoto call saNTfFinalise() and exit on receiving ERR_UNAVAILABLE.I wouldlike to push 1744 before RC2.


Thanks,
Praveen




On 21-Apr-16 5:38 AM, minh chau wrote:

Hi Praveen,

Would you think about quick patch that notify client's mailbox a dummy
callback after Agent detect it's non-member, so NTF client can finalize
handle right after that. Otherwise as below your explanation, there will
be implicit dependency of NTF user on AMF or CLM in this case, and that
should be documented.


Thanks,
Minh

On 14/04/16 07:01, minh chau wrote:



On 13/04/16 15:43, praveen malviya wrote:



On 12-Apr-16 10:24 PM, minh chau wrote:



On 12/04/16 21:49, praveen malviya wrote:



On 12-Apr-16 3:56 PM, minh chau wrote:

Hi Praveen

NTF server also accepts initialize request (and here it comes from
reinitializeClient() after headless) if NTF server has not
initialized
with CLM.
So after headless, this situation will most likely happen. The
recovery
would succeeds, but after that what if NTF server notifies the
agent it
is not longer a member, could a subscriber be waiting for
notification
while agent is not a member anymore?

There is only one event that can lead to this and that is OpenSAF stop
on the node as admin operations are not available in headless state.
But this is the limitation of whole headless solution in every service
as there is no recovery of CLM status of client node at each director
and also recovery of clients is being done very early at MDS up event
of the service.

[Minh] Actually, in non-headless this situation also happens. When
client is subscribing for notification, lock a clm node. This client
will not be informed error code SA_AIS_ERR_UNAVAILABLE if its filter
does not match to any notifications. It has to wait until clm node is
unlocked and there is notification to come, so saNtfDispatch will
return
SA_AIS_ERR_UNAVAILABLE. But if filter does not match, this client will
be waiting and can't finalize handle.
If this situation is solved in non-headless, the problem stated
above in
headless should also be solved by the same solution.

[Praveen]Not only in NTFSv, same logic of waiting for an event to get
unblocked from poll() is valid for all the other services
applications also as all SAF services are integrated with CLMSv. I do
not know whether one should poll indefinitely or not and in case of
finite poll time what an application must do after poll times out.

But I think, from SAF perspective still this cannot be classified as
a problem. The reason is any such application's life cycle is
monitored by AMF and AMF terminates such process as part of CLM node
eviction. Also CLM provides traker interface for this purpose only.
At the same time, I have observed that for ERR_UNAVAILABLE AMF spec
is particularly more clear as it states on section 7.2.1 on page 243
================================
However, there are a few special situations in which processes may
call Availability Management Framework API functions.
• An Availability Management Framework API function is called by a
process nearly at the same time when the node exits the cluster and
the Availability Management Framework area server on the node has not
yet terminated the process.
..........
=================================
And for above mentioned cases AMF will return ERR_UNAVAILABLE.So it
seems ERR_UNAVAILABLE is meant for such special cases.So any
application must rely on its own subscription to CLMSv. Or Admin will
have to take care of this.
I will check other SAF documents like Cprogramming doc and overview
doc if something in this context is mentioned.

[Minh] I think application can be purely NTF client only which does
not have to initialize with AMF, or maybe I don't understand your idea.
Let's look at this example: Running subscriber with filter "ABC", lock
CLM node, unlock CLM node again. Then some applications in cluster
raise notification ABC.
With current implementation, this subscriber get notified
ERR_UNAVAILABLE when notification ABC coming to its mailbox, thus it
eventually lost this notification ABC.
But if NTF notified ERR_UNAVAILABLE after locking CLM node, this
subscriber can earlier finalize its handle with NTF. It can wait by
somehow until CLM node is unlocked again, or it can initialize CLMsv
to know when a node becoming a member again. After unlock CLM as above
example, this subscriber is ready to receive notification and when
notification ABC comes, subscriber can receive it. And I guess this is
the idea mentioned in NTF spec:

/"If the cluster node rejoins the cluster membership, processes
executing on the cluster node will be able to reinitialize new library
handles and use the entire set of Notification Service APIs that
operate on these new handles; however, invocation of APIs that operate
on handles acquired by any process before the cluster node left the
membership will continue to fail with SA_AIS_ERR_UNAVAILABLE with the
exception of saNtfFinalize(), which is used to free the library
handles and all resources associated with these handles. Hence, it is
recommended for the processes to finalize the library handles as soon
as the processes detect that the cluster node left the membership."

/Thanks,
Miinh/
/

Another issue but not relate to this ticket, that ntftool does not
handle SA_AIS_ERR_UNAVAILABLE. I get ntfsubscriber indefinite loop in
calling saNtfDispatch() when ntfsubscriber receives
SA_AIS_ERR_UNAVAILABLE.

[Praveen]I will fix this as a part of #1745.


Thanks,
Praveen

Thanks,
Minh


Thanks,
Praveen

Thanks,
Minh

On 11/04/16 15:46, praveen.malv...@oracle.com wrote:

osaf/libs/agents/saf/ntfa/ntfa_api.c |  28
++++++++++++++++++----------
  1 files changed, 18 insertions(+), 10 deletions(-)


During headless state, OpenSAF may get stopped on payload with
NTF app
running.
Since OpenSAF is not running on the payload, any A.01.02 NTF client
should not be served on
this node and this client should not be recovered. After first
controller comes up, A.01.02
client will not be recovered and application will get
SA_AIS_ERR_UNAVAILABLE upon which an
app can call saNtfFinalize() for freeing the resources.

diff --git a/osaf/libs/agents/saf/ntfa/ntfa_api.c
b/osaf/libs/agents/saf/ntfa/ntfa_api.c
--- a/osaf/libs/agents/saf/ntfa/ntfa_api.c
+++ b/osaf/libs/agents/saf/ntfa/ntfa_api.c
@@ -966,7 +966,8 @@ SaAisErrorT reinitializeClient(ntfa_clie
      }
      if ((rc = o_msg->info.api_resp_info.rc) != SA_AIS_OK) {
          TRACE("info.api_resp_info.rc:%u",
o_msg->info.api_resp_info.rc);
-        rc = SA_AIS_ERR_BAD_HANDLE;
+        if (rc != SA_AIS_ERR_UNAVAILABLE)
+            rc = SA_AIS_ERR_BAD_HANDLE;
          goto done;
      }
@@ -1033,7 +1034,8 @@ SaAisErrorT recoverReader(ntfa_client_hd
      osafassert(o_msg != NULL);
      if ((rc = o_msg->info.api_resp_info.rc) != SA_AIS_OK) {
          TRACE("o_msg->info.api_resp_info.rc:%u",
o_msg->info.api_resp_info.rc);
-        rc = SA_AIS_ERR_BAD_HANDLE;
+        if (rc != SA_AIS_ERR_UNAVAILABLE)
+            rc = SA_AIS_ERR_BAD_HANDLE;
          goto done;
      }
@@ -1108,7 +1110,8 @@ SaAisErrorT recoverSubscriber(ntfa_clien
      if ((rc = o_msg->info.api_resp_info.rc) != SA_AIS_OK) {
          TRACE("o_msg->info.api_resp_info.rc:%u",
o_msg->info.api_resp_info.rc);
-        rc = SA_AIS_ERR_BAD_HANDLE;
+        if (rc != SA_AIS_ERR_UNAVAILABLE)
+            rc = SA_AIS_ERR_BAD_HANDLE;
          goto done;
      }
@@ -1437,7 +1440,7 @@ SaAisErrorT saNtfDispatch(SaNtfHandleT n
      if (!hdl_rec->valid) {
          /* recovery */
          if ((rc = recoverClient(hdl_rec)) != SA_AIS_OK) {
-            if ((rc == SA_AIS_ERR_BAD_HANDLE) || (rc ==
SA_AIS_ERR_UNAVAILABLE)) {
+            if (rc == SA_AIS_ERR_BAD_HANDLE) {
                  ncshm_give_hdl(ntfHandle);
osafassert(pthread_mutex_lock(&ntfa_cb.cb_lock) == 0);
ntfa_hdl_rec_force_del(&ntfa_cb.client_list, hdl_rec);
@@ -1445,6 +1448,11 @@ SaAisErrorT saNtfDispatch(SaNtfHandleT n
                  ntfa_shutdown(false);
                  goto done;
              }
+            if (rc == SA_AIS_ERR_UNAVAILABLE) {
+                TRACE("Node not CLM member or stale client");
+                ncshm_give_hdl(ntfHandle);
+                goto done;
+            }
          }
      }
@@ -1807,7 +1815,7 @@ SaAisErrorT saNtfNotificationSend(SaNtfN
          if ((rc = recoverClient(client_rec)) != SA_AIS_OK) {
              ncshm_give_hdl(client_handle);
              ncshm_give_hdl(notificationHandle);
-            if ((rc == SA_AIS_ERR_BAD_HANDLE) || (rc ==
SA_AIS_ERR_UNAVAILABLE)) {
+            if (rc == SA_AIS_ERR_BAD_HANDLE) {
osafassert(pthread_mutex_lock(&ntfa_cb.cb_lock) == 0);
ntfa_hdl_rec_force_del(&ntfa_cb.client_list,
client_rec);
osafassert(pthread_mutex_unlock(&ntfa_cb.cb_lock) ==
0);
@@ -2153,7 +2161,7 @@ SaAisErrorT saNtfNotificationSubscribe(c
          if (notificationFilterHandles->alarmFilterHandle)

ncshm_give_hdl(notificationFilterHandles->alarmFilterHandle);
      }
-    if (recovery_failed && ((rc == SA_AIS_ERR_BAD_HANDLE) || (rc ==
SA_AIS_ERR_UNAVAILABLE))) {
+    if (recovery_failed && (rc == SA_AIS_ERR_BAD_HANDLE)) {
osafassert(pthread_mutex_lock(&ntfa_cb.cb_lock) == 0);
ntfa_hdl_rec_force_del(&ntfa_cb.client_list, client_hdl_rec);
osafassert(pthread_mutex_unlock(&ntfa_cb.cb_lock) == 0);
@@ -3355,7 +3363,7 @@ SaAisErrorT saNtfNotificationUnsubscribe
      if (!client_hdl_rec->valid && getServerState() ==
NTFA_NTFSV_UP) {
          if ((rc = recoverClient(client_hdl_rec)) != SA_AIS_OK) {
-            if ((rc == SA_AIS_ERR_BAD_HANDLE) || (rc ==
SA_AIS_ERR_UNAVAILABLE)) {
+            if (rc == SA_AIS_ERR_BAD_HANDLE) {
                  ncshm_give_hdl(ntfHandle);
osafassert(pthread_mutex_lock(&ntfa_cb.cb_lock) == 0);
ntfa_hdl_rec_force_del(&ntfa_cb.client_list,
client_hdl_rec);
@@ -3517,7 +3525,7 @@ done_give_client_hdl:
      }
ncshm_give_hdl(notificationFilterHandles->alarmFilterHandle);
-    if (recovery_failed && ((rc == SA_AIS_ERR_BAD_HANDLE) || (rc ==
SA_AIS_ERR_UNAVAILABLE))) {
+    if (recovery_failed && (rc == SA_AIS_ERR_BAD_HANDLE)) {
osafassert(pthread_mutex_lock(&ntfa_cb.cb_lock) == 0);
ntfa_hdl_rec_force_del(&ntfa_cb.client_list, client_hdl_rec);
osafassert(pthread_mutex_unlock(&ntfa_cb.cb_lock) == 0);
@@ -3621,7 +3629,7 @@ SaAisErrorT saNtfNotificationReadFinaliz
      if (!client_hdl_rec->valid && getServerState() ==
NTFA_NTFSV_UP) {
          if ((rc = recoverClient(client_hdl_rec)) != SA_AIS_OK) {
-            if ((rc == SA_AIS_ERR_BAD_HANDLE) || (rc ==
SA_AIS_ERR_UNAVAILABLE)) {
+            if (rc == SA_AIS_ERR_BAD_HANDLE) {
ncshm_give_hdl(client_hdl_rec->local_hdl);
                  ncshm_give_hdl(readhandle);
osafassert(pthread_mutex_lock(&ntfa_cb.cb_lock) == 0);
@@ -3699,7 +3707,7 @@ SaAisErrorT saNtfNotificationReadNext(Sa
          if ((rc = recoverClient(client_hdl_rec)) != SA_AIS_OK) {
ncshm_give_hdl(client_hdl_rec->local_hdl);
              ncshm_give_hdl(readHandle);
-            if ((rc == SA_AIS_ERR_BAD_HANDLE) || (rc ==
SA_AIS_ERR_UNAVAILABLE)) {
+            if (rc == SA_AIS_ERR_BAD_HANDLE) {
osafassert(pthread_mutex_lock(&ntfa_cb.cb_lock) == 0);
ntfa_hdl_rec_force_del(&ntfa_cb.client_list,
client_hdl_rec);
osafassert(pthread_mutex_unlock(&ntfa_cb.cb_lock) ==
0);

diff --git a/osaf/libs/agents/saf/ntfa/ntfa.h b/osaf/libs/agents/saf/ntfa/ntfa.h
--- a/osaf/libs/agents/saf/ntfa/ntfa.h
+++ b/osaf/libs/agents/saf/ntfa/ntfa.h
@@ -196,4 +196,5 @@ extern void ntfa_update_ntfsv_state(ntfa
 extern SaAisErrorT ntfa_copy_ntf_filter_ptrs(ntfsv_filter_ptrs_t* pDes,
                                                                const 
ntfsv_filter_ptrs_t* pSrc);
 extern SaAisErrorT ntfa_del_ntf_filter_ptrs(ntfsv_filter_ptrs_t* filter_ptrs);
+extern void ntfa_notify_handle_invalid(); 
 #endif   /* !NTFA_H */
diff --git a/osaf/libs/agents/saf/ntfa/ntfa_mds.c 
b/osaf/libs/agents/saf/ntfa/ntfa_mds.c
--- a/osaf/libs/agents/saf/ntfa/ntfa_mds.c
+++ b/osaf/libs/agents/saf/ntfa/ntfa_mds.c
@@ -292,8 +292,10 @@ uint32_t ntfa_ntfs_msg_proc(ntfa_cb_t *c
                                         return NCSCC_RC_FAILURE;
                                 }
                                //A client becomes stale if Node loses CLM 
Membership.
-                               if (cb->clm_node_state != SA_CLM_NODE_JOINED)
+                               if (cb->clm_node_state != SA_CLM_NODE_JOINED) {
                                        ntfa_hdl_rec->is_stale_client = true;
+                                       ntfa_notify_handle_invalid();
+                               }
                                ntfa_msg_destroy(ntfsv_msg);
                        }
                        break;
diff --git a/osaf/tools/safntf/ntfsubscribe/ntfsubscribe.c 
b/osaf/tools/safntf/ntfsubscribe/ntfsubscribe.c
--- a/osaf/tools/safntf/ntfsubscribe/ntfsubscribe.c
+++ b/osaf/tools/safntf/ntfsubscribe/ntfsubscribe.c
@@ -152,6 +152,15 @@ static SaAisErrorT waitForNotifications(
 
                        if (error != SA_AIS_OK)
                                fprintf(stderr, "ntftool_saNtfDispatch Error 
%d\n", error);
+                       if (error == SA_AIS_ERR_UNAVAILABLE) {
+                               fprintf(stderr, "Node lost CLM membership, 
finalizing ntfHandle.\n");
+                               error = saNtfFinalize(ntfHandle);
+                               if (error != SA_AIS_OK) {
+                                       fprintf(stderr, "saNtfFinalize failed - 
%d\n", error);
+                                       exit(EXIT_FAILURE);
+                               }
+                               _Exit(0);
+                       }
                }
                if ((fds[FD_TERM].revents & POLLIN) || (fds[FD_INT].revents & 
POLLIN)) {

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z

_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1 of 1] ntfa: return ERR_UNAVAILABLE on non-member node after headless state [#1744]

Reply via email to