Re: [devel] [PATCH 1 of 1] ntfa: return ERR_UNAVAILABLE on non-member node after headless state [#1744]

minh chau Thu, 21 Apr 2016 01:41:16 -0700

Hi,

The addon patch at least can help existing NTF subscriber quickly 
finalizes as soon as node becomes non-member, this should not be late 
informed until the node rejoin cluster as current #1744. The next step 
how to detect NTF service available again depends on whether client is 
pure NTF application (like ntftool) or a SAF application.
So it's ack from me for #1744 + addon patch.


Tahnks,
Minh
On 21/04/16 17:27, praveen malviya wrote:
> Hi Minh,
>
> Return code ERR_UNAVAILABLE is not an indication for any client that 
> node has lost CLM membership because same return code is given for a 
> stale client when node again becomes member.
>  Also when node loses membership and a client gets ERR_UNAVAILABLE, it 
> will finalize all handles. After finalizing again client needs an 
> indication that node has joined the membership (it cannot try in while 
> loop for ERR_UNAVAILABLE on sa<*>initialize()). Such an indication, 
> this client an get only when it is a client of CLM also with tracker 
> interface. Tracker interface APIs works on non-menber nodes also but 
> it gives only local node information on non member node and such a 
> client needs only this much information. So upon receiving CLM 
> callback for local node joining, this client will go and call 
> sa,*>Initialize() and this call will succeed. In this way,I think, for 
> normal cluster it is the responsibility of client process to detect 
> node membership status by becoming CLM client also.
>
> In headless state, CLM member ship status of client nodes is not 
> remember by directors and they will have to rely on new CLM callbacks 
> after first controller comes up. At the same time CLM client will also 
> get BAD_HANDLE after first controller comes up. Considering this 
> situation, attached patch 1744_addon.patch will give dummy event to 
> client so that it can call saNtfDispatch. It will work for both 
> headless and non-headless cluster. But this topic can be revisited a) 
> 5.1 when all services only on CLM indication or b)when we have more 
> clarity on CLM status of nodes during headless.
>
> Attached patch in on top of #1744 patch and it fixes ntfsubscribe also 
> to call saNTfFinalise() and exit on receiving ERR_UNAVAILABLE.I would 
> like to push 1744 before RC2.
>
> Thanks,
> Praveen
>
>
>
>
> On 21-Apr-16 5:38 AM, minh chau wrote:
>> Hi Praveen,
>>
>> Would you think about quick patch that notify client's mailbox a dummy
>> callback after Agent detect it's non-member, so NTF client can finalize
>> handle right after that. Otherwise as below your explanation, there will
>> be implicit dependency of NTF user on AMF or CLM in this case, and that
>> should be documented.
>
>>
>> Thanks,
>> Minh
>>
>> On 14/04/16 07:01, minh chau wrote:
>>>
>>>
>>> On 13/04/16 15:43, praveen malviya wrote:
>>>>
>>>>
>>>> On 12-Apr-16 10:24 PM, minh chau wrote:
>>>>>
>>>>>
>>>>> On 12/04/16 21:49, praveen malviya wrote:
>>>>>>
>>>>>>
>>>>>> On 12-Apr-16 3:56 PM, minh chau wrote:
>>>>>>> Hi Praveen
>>>>>>>
>>>>>>> NTF server also accepts initialize request (and here it comes from
>>>>>>> reinitializeClient() after headless) if NTF server has not
>>>>>>> initialized
>>>>>>> with CLM.
>>>>>>> So after headless, this situation will most likely happen. The
>>>>>>> recovery
>>>>>>> would succeeds, but after that what if NTF server notifies the
>>>>>>> agent it
>>>>>>> is not longer a member, could a subscriber be waiting for
>>>>>>> notification
>>>>>>> while agent is not a member anymore?
>>>>>>>
>>>>>> There is only one event that can lead to this and that is OpenSAF 
>>>>>> stop
>>>>>> on the node as admin operations are not available in headless state.
>>>>>> But this is the limitation of whole headless solution in every 
>>>>>> service
>>>>>> as there is no recovery of CLM status of client node at each 
>>>>>> director
>>>>>> and also recovery of clients is being done very early at MDS up 
>>>>>> event
>>>>>> of the service.
>>>>>>
>>>>> [Minh] Actually, in non-headless this situation also happens. When
>>>>> client is subscribing for notification, lock a clm node. This client
>>>>> will not be informed error code SA_AIS_ERR_UNAVAILABLE if its filter
>>>>> does not match to any notifications. It has to wait until clm node is
>>>>> unlocked and there is notification to come, so saNtfDispatch will
>>>>> return
>>>>> SA_AIS_ERR_UNAVAILABLE. But if filter does not match, this client 
>>>>> will
>>>>> be waiting and can't finalize handle.
>>>>> If this situation is solved in non-headless, the problem stated
>>>>> above in
>>>>> headless should also be solved by the same solution.
>>>>>
>>>> [Praveen]Not only in NTFSv, same logic of waiting for an event to get
>>>> unblocked from poll() is valid for all the other services
>>>> applications also as all SAF services are integrated with CLMSv. I do
>>>> not know whether one should poll indefinitely or not and in case of
>>>> finite poll time what an application must do after poll times out.
>>>>
>>>> But I think, from SAF perspective still this cannot be classified as
>>>> a problem. The reason is any such application's life cycle is
>>>> monitored by AMF and AMF terminates such process as part of CLM node
>>>> eviction. Also CLM provides traker interface for this purpose only.
>>>> At the same time, I have observed that for ERR_UNAVAILABLE AMF spec
>>>> is particularly more clear as it states on section 7.2.1 on page 243
>>>> ================================
>>>> However, there are a few special situations in which processes may
>>>> call Availability Management Framework API functions.
>>>> • An Availability Management Framework API function is called by a
>>>> process nearly at the same time when the node exits the cluster and
>>>> the Availability Management Framework area server on the node has not
>>>> yet terminated the process.
>>>> ..........
>>>> =================================
>>>> And for above mentioned cases AMF will return ERR_UNAVAILABLE.So it
>>>> seems ERR_UNAVAILABLE is meant for such special cases.So any
>>>> application must rely on its own subscription to CLMSv. Or Admin will
>>>> have to take care of this.
>>>> I will check other SAF documents like Cprogramming doc and overview
>>>> doc if something in this context is mentioned.
>>> [Minh] I think application can be purely NTF client only which does
>>> not have to initialize with AMF, or maybe I don't understand your idea.
>>> Let's look at this example: Running subscriber with filter "ABC", lock
>>> CLM node, unlock CLM node again. Then some applications in cluster
>>> raise notification ABC.
>>> With current implementation, this subscriber get notified
>>> ERR_UNAVAILABLE when notification ABC coming to its mailbox, thus it
>>> eventually lost this notification ABC.
>>> But if NTF notified ERR_UNAVAILABLE after locking CLM node, this
>>> subscriber can earlier finalize its handle with NTF. It can wait by
>>> somehow until CLM node is unlocked again, or it can initialize CLMsv
>>> to know when a node becoming a member again. After unlock CLM as above
>>> example, this subscriber is ready to receive notification and when
>>> notification ABC comes, subscriber can receive it. And I guess this is
>>> the idea mentioned in NTF spec:
>>>
>>> /"If the cluster node rejoins the cluster membership, processes
>>> executing on the cluster node will be able to reinitialize new library
>>> handles and use the entire set of Notification Service APIs that
>>> operate on these new handles; however, invocation of APIs that operate
>>> on handles acquired by any process before the cluster node left the
>>> membership will continue to fail with SA_AIS_ERR_UNAVAILABLE with the
>>> exception of saNtfFinalize(), which is used to free the library
>>> handles and all resources associated with these handles. Hence, it is
>>> recommended for the processes to finalize the library handles as soon
>>> as the processes detect that the cluster node left the membership."
>>>
>>> /Thanks,
>>> Miinh/
>>> /
>>>>
>>>>
>>>>> Another issue but not relate to this ticket, that ntftool does not
>>>>> handle SA_AIS_ERR_UNAVAILABLE. I get ntfsubscriber indefinite loop in
>>>>> calling saNtfDispatch() when ntfsubscriber receives
>>>>> SA_AIS_ERR_UNAVAILABLE.
>>>>>
>>>> [Praveen]I will fix this as a part of #1745.
>>>>
>>>>
>>>> Thanks,
>>>> Praveen
>>>>> Thanks,
>>>>> Minh
>>>>>>
>>>>>> Thanks,
>>>>>> Praveen
>>>>>>> Thanks,
>>>>>>> Minh
>>>>>>>
>>>>>>> On 11/04/16 15:46, praveen.malv...@oracle.com wrote:
>>>>>>>> osaf/libs/agents/saf/ntfa/ntfa_api.c |  28
>>>>>>>> ++++++++++++++++++----------
>>>>>>>>   1 files changed, 18 insertions(+), 10 deletions(-)
>>>>>>>>
>>>>>>>>
>>>>>>>> During headless state, OpenSAF may get stopped on payload with
>>>>>>>> NTF app
>>>>>>>> running.
>>>>>>>> Since OpenSAF is not running on the payload, any A.01.02 NTF 
>>>>>>>> client
>>>>>>>> should not be served on
>>>>>>>> this node and this client should not be recovered. After first
>>>>>>>> controller comes up, A.01.02
>>>>>>>> client will not be recovered and application will get
>>>>>>>> SA_AIS_ERR_UNAVAILABLE upon which an
>>>>>>>> app can call saNtfFinalize() for freeing the resources.
>>>>>>>>
>>>>>>>> diff --git a/osaf/libs/agents/saf/ntfa/ntfa_api.c
>>>>>>>> b/osaf/libs/agents/saf/ntfa/ntfa_api.c
>>>>>>>> --- a/osaf/libs/agents/saf/ntfa/ntfa_api.c
>>>>>>>> +++ b/osaf/libs/agents/saf/ntfa/ntfa_api.c
>>>>>>>> @@ -966,7 +966,8 @@ SaAisErrorT reinitializeClient(ntfa_clie
>>>>>>>>       }
>>>>>>>>       if ((rc = o_msg->info.api_resp_info.rc) != SA_AIS_OK) {
>>>>>>>>           TRACE("info.api_resp_info.rc:%u",
>>>>>>>> o_msg->info.api_resp_info.rc);
>>>>>>>> -        rc = SA_AIS_ERR_BAD_HANDLE;
>>>>>>>> +        if (rc != SA_AIS_ERR_UNAVAILABLE)
>>>>>>>> +            rc = SA_AIS_ERR_BAD_HANDLE;
>>>>>>>>           goto done;
>>>>>>>>       }
>>>>>>>> @@ -1033,7 +1034,8 @@ SaAisErrorT recoverReader(ntfa_client_hd
>>>>>>>>       osafassert(o_msg != NULL);
>>>>>>>>       if ((rc = o_msg->info.api_resp_info.rc) != SA_AIS_OK) {
>>>>>>>> TRACE("o_msg->info.api_resp_info.rc:%u",
>>>>>>>> o_msg->info.api_resp_info.rc);
>>>>>>>> -        rc = SA_AIS_ERR_BAD_HANDLE;
>>>>>>>> +        if (rc != SA_AIS_ERR_UNAVAILABLE)
>>>>>>>> +            rc = SA_AIS_ERR_BAD_HANDLE;
>>>>>>>>           goto done;
>>>>>>>>       }
>>>>>>>> @@ -1108,7 +1110,8 @@ SaAisErrorT recoverSubscriber(ntfa_clien
>>>>>>>>       if ((rc = o_msg->info.api_resp_info.rc) != SA_AIS_OK) {
>>>>>>>> TRACE("o_msg->info.api_resp_info.rc:%u",
>>>>>>>> o_msg->info.api_resp_info.rc);
>>>>>>>> -        rc = SA_AIS_ERR_BAD_HANDLE;
>>>>>>>> +        if (rc != SA_AIS_ERR_UNAVAILABLE)
>>>>>>>> +            rc = SA_AIS_ERR_BAD_HANDLE;
>>>>>>>>           goto done;
>>>>>>>>       }
>>>>>>>> @@ -1437,7 +1440,7 @@ SaAisErrorT saNtfDispatch(SaNtfHandleT n
>>>>>>>>       if (!hdl_rec->valid) {
>>>>>>>>           /* recovery */
>>>>>>>>           if ((rc = recoverClient(hdl_rec)) != SA_AIS_OK) {
>>>>>>>> -            if ((rc == SA_AIS_ERR_BAD_HANDLE) || (rc ==
>>>>>>>> SA_AIS_ERR_UNAVAILABLE)) {
>>>>>>>> +            if (rc == SA_AIS_ERR_BAD_HANDLE) {
>>>>>>>>                   ncshm_give_hdl(ntfHandle);
>>>>>>>> osafassert(pthread_mutex_lock(&ntfa_cb.cb_lock) == 0);
>>>>>>>> ntfa_hdl_rec_force_del(&ntfa_cb.client_list, hdl_rec);
>>>>>>>> @@ -1445,6 +1448,11 @@ SaAisErrorT saNtfDispatch(SaNtfHandleT n
>>>>>>>>                   ntfa_shutdown(false);
>>>>>>>>                   goto done;
>>>>>>>>               }
>>>>>>>> +            if (rc == SA_AIS_ERR_UNAVAILABLE) {
>>>>>>>> +                TRACE("Node not CLM member or stale client");
>>>>>>>> +                ncshm_give_hdl(ntfHandle);
>>>>>>>> +                goto done;
>>>>>>>> +            }
>>>>>>>>           }
>>>>>>>>       }
>>>>>>>> @@ -1807,7 +1815,7 @@ SaAisErrorT saNtfNotificationSend(SaNtfN
>>>>>>>>           if ((rc = recoverClient(client_rec)) != SA_AIS_OK) {
>>>>>>>>               ncshm_give_hdl(client_handle);
>>>>>>>>               ncshm_give_hdl(notificationHandle);
>>>>>>>> -            if ((rc == SA_AIS_ERR_BAD_HANDLE) || (rc ==
>>>>>>>> SA_AIS_ERR_UNAVAILABLE)) {
>>>>>>>> +            if (rc == SA_AIS_ERR_BAD_HANDLE) {
>>>>>>>> osafassert(pthread_mutex_lock(&ntfa_cb.cb_lock) == 0);
>>>>>>>> ntfa_hdl_rec_force_del(&ntfa_cb.client_list,
>>>>>>>> client_rec);
>>>>>>>> osafassert(pthread_mutex_unlock(&ntfa_cb.cb_lock) ==
>>>>>>>> 0);
>>>>>>>> @@ -2153,7 +2161,7 @@ SaAisErrorT saNtfNotificationSubscribe(c
>>>>>>>>           if (notificationFilterHandles->alarmFilterHandle)
>>>>>>>>
>>>>>>>> ncshm_give_hdl(notificationFilterHandles->alarmFilterHandle);
>>>>>>>>       }
>>>>>>>> -    if (recovery_failed && ((rc == SA_AIS_ERR_BAD_HANDLE) || 
>>>>>>>> (rc ==
>>>>>>>> SA_AIS_ERR_UNAVAILABLE))) {
>>>>>>>> +    if (recovery_failed && (rc == SA_AIS_ERR_BAD_HANDLE)) {
>>>>>>>> osafassert(pthread_mutex_lock(&ntfa_cb.cb_lock) == 0);
>>>>>>>> ntfa_hdl_rec_force_del(&ntfa_cb.client_list, client_hdl_rec);
>>>>>>>> osafassert(pthread_mutex_unlock(&ntfa_cb.cb_lock) == 0);
>>>>>>>> @@ -3355,7 +3363,7 @@ SaAisErrorT saNtfNotificationUnsubscribe
>>>>>>>>       if (!client_hdl_rec->valid && getServerState() ==
>>>>>>>> NTFA_NTFSV_UP) {
>>>>>>>>           if ((rc = recoverClient(client_hdl_rec)) != SA_AIS_OK) {
>>>>>>>> -            if ((rc == SA_AIS_ERR_BAD_HANDLE) || (rc ==
>>>>>>>> SA_AIS_ERR_UNAVAILABLE)) {
>>>>>>>> +            if (rc == SA_AIS_ERR_BAD_HANDLE) {
>>>>>>>>                   ncshm_give_hdl(ntfHandle);
>>>>>>>> osafassert(pthread_mutex_lock(&ntfa_cb.cb_lock) == 0);
>>>>>>>> ntfa_hdl_rec_force_del(&ntfa_cb.client_list,
>>>>>>>> client_hdl_rec);
>>>>>>>> @@ -3517,7 +3525,7 @@ done_give_client_hdl:
>>>>>>>>       }
>>>>>>>> ncshm_give_hdl(notificationFilterHandles->alarmFilterHandle);
>>>>>>>> -    if (recovery_failed && ((rc == SA_AIS_ERR_BAD_HANDLE) || 
>>>>>>>> (rc ==
>>>>>>>> SA_AIS_ERR_UNAVAILABLE))) {
>>>>>>>> +    if (recovery_failed && (rc == SA_AIS_ERR_BAD_HANDLE)) {
>>>>>>>> osafassert(pthread_mutex_lock(&ntfa_cb.cb_lock) == 0);
>>>>>>>> ntfa_hdl_rec_force_del(&ntfa_cb.client_list, client_hdl_rec);
>>>>>>>> osafassert(pthread_mutex_unlock(&ntfa_cb.cb_lock) == 0);
>>>>>>>> @@ -3621,7 +3629,7 @@ SaAisErrorT saNtfNotificationReadFinaliz
>>>>>>>>       if (!client_hdl_rec->valid && getServerState() ==
>>>>>>>> NTFA_NTFSV_UP) {
>>>>>>>>           if ((rc = recoverClient(client_hdl_rec)) != SA_AIS_OK) {
>>>>>>>> -            if ((rc == SA_AIS_ERR_BAD_HANDLE) || (rc ==
>>>>>>>> SA_AIS_ERR_UNAVAILABLE)) {
>>>>>>>> +            if (rc == SA_AIS_ERR_BAD_HANDLE) {
>>>>>>>> ncshm_give_hdl(client_hdl_rec->local_hdl);
>>>>>>>>                   ncshm_give_hdl(readhandle);
>>>>>>>> osafassert(pthread_mutex_lock(&ntfa_cb.cb_lock) == 0);
>>>>>>>> @@ -3699,7 +3707,7 @@ SaAisErrorT saNtfNotificationReadNext(Sa
>>>>>>>>           if ((rc = recoverClient(client_hdl_rec)) != SA_AIS_OK) {
>>>>>>>> ncshm_give_hdl(client_hdl_rec->local_hdl);
>>>>>>>>               ncshm_give_hdl(readHandle);
>>>>>>>> -            if ((rc == SA_AIS_ERR_BAD_HANDLE) || (rc ==
>>>>>>>> SA_AIS_ERR_UNAVAILABLE)) {
>>>>>>>> +            if (rc == SA_AIS_ERR_BAD_HANDLE) {
>>>>>>>> osafassert(pthread_mutex_lock(&ntfa_cb.cb_lock) == 0);
>>>>>>>> ntfa_hdl_rec_force_del(&ntfa_cb.client_list,
>>>>>>>> client_hdl_rec);
>>>>>>>> osafassert(pthread_mutex_unlock(&ntfa_cb.cb_lock) ==
>>>>>>>> 0);
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>


------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1 of 1] ntfa: return ERR_UNAVAILABLE on non-member node after headless state [#1744]

Reply via email to