Inline [ll]
Thanks.

-----Original Message-----
From: Nguyen Minh Vu [mailto:vu.m.ngu...@dektech.com.au]
Sent: Tuesday, September 10, 2019 12:53 AM
To: William R Elliott <william.elli...@netcracker.com>; 
opensaf-users@lists.sourceforge.net; Lisa Ann Lentz-Liddell 
<lisa.a.lentz-lidd...@netcracker.com>; David S Thompson 
<david.thomp...@netcracker.com>
Subject: Re: [users] Issues concerning opensaf with TCP

[External Email]
________________________________



Hi,

Please see my responses for questions # 3 and #4.

Regards, Vu

On 9/7/19 4:07 AM, William R Elliott wrote:
> Hello,
>
> We are using opensaf version 5.1.0.  We have a cluster using tcp as a 
> transport mechanism with opensaf multicast feature enabled.
> We would appreciate answers to the following questions:
>
> 1.       Please provide a link or any document that gives details on how the 
> opensaf mds layer works.
>
> 2.       osafntfd ER ntfs_mds_msg_send FAILED  - Trace of the problem.
>
> Sep 3 19:57:26.107676 osafntfd [11558:NtfClient.cc:0202] <<
> notificationReceived Sep 3 19:57:26.107679 osafntfd
> [11558:NtfClient.cc:0147] >> notificationReceived: 108 2 Sep 3
> 19:57:26.107685 osafntfd [11558:NtfFilter.cc:0464] >> checkFilter Sep
> 3 19:57:26.107711 osafntfd [11558:ntfsv_mem.c:0769] >>
> ntfsv_get_ntf_header Sep 3 19:57:26.107721 osafntfd
> [11558:ntfsv_mem.c:0790] << ntfsv_get_ntf_header Sep 3 19:57:26.107726
> osafntfd [11558:NtfFilter.cc:0071] T8 numNotificationClassIds: 0 Sep 3
> 19:57:26.107729 osafntfd [11558:NtfFilter.cc:0056] T8 num EventTypes:
> 1 Sep 3 19:57:26.107732 osafntfd [11558:NtfFilter.cc:0060] T2
> EventTypes matches Sep 3 19:57:26.107735 osafntfd
> [11558:NtfFilter.cc:0187] T8 num notificationObjects: 0 Sep 3
> 19:57:26.107738 osafntfd [11558:NtfFilter.cc:0202] T8 num
> NotifyingObjects: 0 Sep 3 19:57:26.107741 osafntfd
> [11558:NtfFilter.cc:0223] T2 hdfilter matches Sep 3 19:57:26.107745
> osafntfd [11558:NtfFilter.cc:0087] T8 numSi: 0 Sep 3 19:57:26.107748
> osafntfd [11558:NtfFilter.cc:0471] << checkFilter Sep 3
> 19:57:26.107751 osafntfd [11558:NtfClient.cc:0184] T2
> NtfClient::notificationReceived notification 723 matches subscription
> 0, client 108 Sep 3 19:57:26.107756 osafntfd
> [11558:NtfNotification.cc:0105] T1 Subscription 0 added to list in
> notification 723 client 108, subscriptionList size is 1 Sep 3
> 19:57:26.107761 osafntfd [11558:NtfSubscription.cc:0211] >>
> sendNotification Sep 3 19:57:26.107764 osafntfd
> [11558:NtfSubscription.cc:0222] T3 send_notification_lib called,
> client 108, notification 723 Sep 3 19:57:26.107768 osafntfd
> [11558:ntfs_com.c:0284] >> send_notification_lib Sep 3 19:57:26.107771
> osafntfd [11558:ntfsv_mem.c:0769] >> ntfsv_get_ntf_header Sep 3
> 19:57:26.107774 osafntfd [11558:ntfsv_mem.c:0790] <<
> ntfsv_get_ntf_header Sep 3 19:57:26.107777 osafntfd
> [11558:ntfs_com.c:0286] T3 client id: 108, not_id: 723 Sep 3
> 19:57:26.107781 osafntfd [11558:mds_c_sndrcv.c:0396] >> mds_send Sep 3
> 19:57:26.107785 osafntfd [11558:mds_c_sndrcv.c:0403] << mds_send Sep 3
> 19:57:26.107788 osafntfd [11558:mds_c_sndrcv.c:0681] >> mds_mcm_send
> Sep 3 19:57:26.107791 osafntfd [11558:mds_c_sndrcv.c:0916] >>
> mcm_pvt_normal_svc_snd Sep 3 19:57:26.107794 osafntfd
> [11558:mds_c_sndrcv.c:0956] >> mcm_pvt_normal_snd_process_common Sep 3
> 19:57:26.107800 osafntfd [11558:mds_c_sndrcv.c:1699] >>
> mds_mcm_process_disc_queue_checks Sep 3 19:57:26.107804 osafntfd
> [11558:mds_c_sndrcv.c:1740] TR in else if sub_info->tmr_flag !- true
> Sep 3 19:57:26.107813 osafntfd [11558:mds_c_sndrcv.c:1747] TR
> MDS_SND_RCV:Subscription exists but no timer running Sep 3
> 19:57:26.107816 osafntfd [11558:mds_c_sndrcv.c:1749] TR MDS_SND_RCV :L
> mds_mcm_process_disc_queue_checks Sep 3 19:57:26.107819 osafntfd
> [11558:mds_c_sndrcv.c:1750] << mds_mcm_process_disc_queue_checks Sep 3
> 19:57:26.107900 osafntfd [11558:mds_c_sndrcv.c:1048] <<
> mcm_pvt_normal_snd_process_common Sep 3 19:57:26.107937 osafntfd
> [11558:mds_c_sndrcv.c:0937] >> mcm_pvt_normal_svc_snd Sep 3
> 19:57:26.107941 osafntfd [11558:mds_c_sndrcv.c:0846] << mds_mcm_send
> Sep 3 19:57:26.108141 osafntfd [11558:ntfs_mds.c:1290] ER
> ntfs_mds_msg_send FAILED Sep 3 19:57:26.108160 osafntfd
> [11558:ntfs_com.c:0308] ER ntfs_mds_msg_send to ntfa failed rc: 2 Sep
> 3 19:57:26.108165 osafntfd [11558:NtfNotification.cc:0142] T1 Removing
> subscription 0 client 108 from notification 723, subscriptionList size
> is 0 Sep 3 19:57:26.108169 osafntfd [11558:ntfs_com.c:0503] >>
> sendNotConfirmUpdate: client: 108, subId: 0, notId: 723
>
> a.      The traces show that a notification is received for client id 108 and 
> then a mds_send is tried for the same client id but it fails because there is 
> no timer running.
> b.      What does a client represent?  A opensaf process?  A SU?  A component?
> c.      What is the purpose of sending a message back after receipt of the 
> notification?  Since it is not sent it is discarded and does not seem to have 
> any impact to the cluster.
> d.      Hardcoded timers are defined in mds_main.c:
>
> uint32_t MDS_QUIESCED_TMR_VAL = 80;
> uint32_t MDS_AWAIT_ACTIVE_TMR_VAL = 18000; uint32_t
> MDS_SUBSCRIPTION_TMR_VAL = 500; uint32_t MDTM_REASSEMBLE_TMR_VAL =
> 500; uint32_t MDTM_CACHED_EVENTS_TMR_VAL = 24000;;
>
> Could each one of these be explained? Can any of these be increased? If yes, 
> what effect would that have?
>
>
> 3.       Are there limitations of a size of a cluster, the number of SGs, the 
> number of SUs, the number of components per SU?   Testing shows that as the 
> number of SUs/components increase, the
> errors when starting the cluster and node appear/increase.  Some errors that 
> are seen from diff starts :
>
> Sep  6 16:16:09 host--s1-h2 osafimmnd[3119]: NO ERR_BAD_HANDLE: Admin
> owner 19 does not exist
>
> Sep  6 17:57:33 host--s1-h1 osafimmnd[2238]: WA MDS Send Failed to
> service:IMMND rc:2 Sep  6 17:57:33 host--s1-h1 osafimmnd[2238]: ER Problem in 
> sending to peer IMMND over MDS. Discarding admin op reply.
> Sep  6 17:57:33 host--s1-h1 osafimmnd[2238]: WA Error code 2 returned
> for message type 21 - ignoring
>
> a.      Some of our SUs have more than 30 components.
> b.      All of the components of the cluster perform an IMM search on 
> specific SUs to understand the state of some 2N SG to work with the active SU 
> of those SGs.
> c.      All components register for notifications so that they can react to 
> the 2N SG state changes.
[Vu] No, there is no limitations for such numbers.

In regard to above IMM syslog, it indicated that the peer IMMND was restarted 
or there was issue with network connectivity. You should check mds.log to see 
more info.

[ll]  As the components/SUs increase, the instability of the cluster increases.
The mds log contains a little more detail - the node id but other than that it 
is not helpful

Aug 20 20:01:49.457513 osafntfd[73828] ERR  |MDS_SND_RCV:No Route Found from 
svc_id = NTFS(28), to svc_id = NTFA(29) on Adest = <0x0002140f, 60207>
Aug 20 20:01:49.457590 osafntfd[73828] ERR  |MDS_SND_RCV: Normal send Message 
sent Failed from svc_id = NTFS(28), to svc_id = NTFA(29)
Aug 20 20:01:49.457612 osafntfd[73828] ERR  |MDS_SND_RCV: 
Adest=<0x0002140f,60207>

After turning on debug, the code went down the path that the timer for the svc 
id was expired and that is why the message was not sent.  How would one 
understand that from the above output?  The log messages contain little detail 
to truly understand what the problem is.  Debug cannot be on in a prod env.  So 
understanding the root cause is very difficult when the messages contain no 
detail of what the actual root of the problem is.  There is a single failure 
value which could mean many different paths in the code.  Could that be 
expanded to actual contain the real reason why it failed  (e.g. for the above, 
the message should state that the timer was expired)?
>
> 4.  Warning of IMMND Client went down is seen on all nodes of the cluster:
>
> Sep  6 16:16:09 host--s1-h1 osafimmnd[31600]: WA IMMND - Client went
> down so no response Sep  6 16:16:09 host--s1-h1 osafimmnd[31600]: NO
> ERR_BAD_HANDLE: Admin owner 19 does not exist
>
> a)      What went down?  What is "client"?  The IMMND did not die nor reboot?
The IMM client who triggered IMM requests went down. To see which client on 
which node goes down, you can enable IMMND trace by un-commenting below line in 
immnd.conf:
#args="--tracemask=0xffffffff"

[ll]  This is hard to do after the fact.  The error messages that are outputted 
contain no detail for one to truly understand what the problem is.  One should 
not have to turn on trace to get that understanding especially when these occur 
in a prod env.

The TRACE looks like:

<143>1 2019-09-10T11:37:15.496328+07:00 SC-1 osafimmnd 224 osafimmnd [meta 
sequenceId="1549"] 224:imm/immnd/immnd_evt.c:12316 T2 IMMA UP EVENT
<143>1 2019-09-10T11:37:15.496573+07:00 SC-1 osafimmnd 224 osafimmnd [meta 
sequenceId="1550"] 226:base/osaf_secutil.c:68 >> handle_new_connection ..
<143>1 2019-09-10T11:37:15.497708+07:00 SC-1 osafimmnd 224 osafimmnd [meta 
sequenceId="1560"] 224:imm/immnd/immnd_evt.c:980 T2 Added client with id: 
11a0002010f <node:2010f, count:282> ...
<143>1 2019-09-10T11:38:30.891454+07:00 SC-1 osafimmnd 224 osafimmnd [meta 
sequenceId="1790"] 224:imm/immnd/immnd_evt.c:12175 T2 IMMA DOWN EVENT
<143>1 2019-09-10T11:38:30.891497+07:00 SC-1 osafimmnd 224 osafimmnd [meta 
sequenceId="1791"] 224:imm/immnd/immnd_proc.c:91 >> 
immnd_proc_imma_discard_connection
<143>1 2019-09-10T11:38:30.891521+07:00 SC-1 osafimmnd 224 osafimmnd [meta 
sequenceId="1792"] 224:imm/immnd/immnd_proc.c:96 T5 Attempting discard 
connection id:11a0002010f <n:2010f, c:282>
<143>1 2019-09-10T11:38:30.891784+07:00 SC-1 osafimmnd 224 osafimmnd [meta 
sequenceId="1801"] 224:imm/immnd/immnd_proc.c:310 T5 Removing client 
id:11a0002010f sv_id:26
<143>1 2019-09-10T11:38:30.891823+07:00 SC-1 osafimmnd 224 osafimmnd [meta 
sequenceId="1802"] 224:imm/immnd/immnd_proc.c:331 T5 Removed 1 IMMA clients

> b)      Can these messages be expanded to include more detail to truly 
> understand what occurred?
[Vu] At this point, the detailed information about that client is already 
removed from IMM, however we can show the node id on which the client is 
running.

[ll]   That would be helpful.  It is very difficult debugging, explaining to 
others what the problems are.
>
>
>
>
> ________________________________
> The information transmitted herein is intended only for the person or entity 
> to which it is addressed and may contain confidential, proprietary and/or 
> privileged material. Any review, retransmission, dissemination or other use 
> of, or taking of any action in reliance upon, this information by persons or 
> entities other than the intended recipient is prohibited. If you received 
> this in error, please contact the sender and delete the material from any 
> computer.
>
> _______________________________________________
> Opensaf-users mailing list
> Opensaf-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/opensaf-users



________________________________
The information transmitted herein is intended only for the person or entity to 
which it is addressed and may contain confidential, proprietary and/or 
privileged material. Any review, retransmission, dissemination or other use of, 
or taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received this 
in error, please contact the sender and delete the material from any computer.

_______________________________________________
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to