Comments inline [ll]
Thanks in advance.
-----Original Message-----
From: [email protected]
[mailto:[email protected]]
Sent: Monday, September 9, 2019 8:19 AM
To: [email protected]
Subject: Opensaf-users Digest, Vol 72, Issue 2
[External Email]
________________________________
Send Opensaf-users mailing list submissions to
[email protected]
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.sourceforge.net/lists/listinfo/opensaf-users
or, via email, send a message with subject or body 'help' to
[email protected]
You can reach the person managing the list at
[email protected]
When replying, please edit your Subject line so it is more specific than "Re:
Contents of Opensaf-users digest..."
Today's Topics:
1. Re: Opensaf-users Digest, Vol 72, Issue 1 (Minh Hon Chau)
----------------------------------------------------------------------
Message: 1
Date: Mon, 9 Sep 2019 20:48:56 +1000
From: Minh Hon Chau <[email protected]>
To: [email protected],
[email protected]
Subject: Re: [users] Opensaf-users Digest, Vol 72, Issue 1
Message-ID: <[email protected]>
Content-Type: text/plain; charset=utf-8; format=flowed
Hi,
You can find the mds documentation is here:
https://sourceforge.net/p/opensaf/internal-docs/ci/default/tree/programmers_reference/
From your osafntfd trace, the notification 723 is sent by clientId=2, there is
a ntf subscriber with clientId=108 and the notification 723 matches the
subscription criteria of clientId=108, thus the notification is "forwarded" to
clientId=108.
In ntf, a client can be a sender, a reader, or a subscriber. As long as a
saNtfInitialize() succeeds, a client is created, so you may have many ntf
clients in your process. I haven't tried many saNtfInitalize() to create
multiple clients in one process but I think it should work.
[ll] The components do not call saNtfInitialize() more than once.
The mds error "Subscription exists but no timer running", one possibility is
that the timer MDS_SUBSCRIPTION_TMR_VAL may be a bit short, so it timed out too
fast before the event MDTM_LIB_UP_TYPE of dtm can reach to mds.
[ll] These ntfs_mds_msg_send failures usually are occurring after a SU has
died, node has died, etc. Could there be code that is not properly cleaning up
clients that have died?
If after increasing the timer does not help, I think you can try to turn the
dtm trace on, enable mds debug log? (export MDS_LOG_LEVEL=5 in ntfd.conf), and
see whether the event MDTM_LIB_UP_TYPE is created at dtm and it does reach to
mds.
[ll] are there answers for the rest of the questions?
/Minh
On 7/9/19 10:21 pm, [email protected] wrote:
> Send Opensaf-users mailing list submissions to
> [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> or, via email, send a message with subject or body 'help' to
> [email protected]
>
> You can reach the person managing the list at
> [email protected]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Opensaf-users digest..."
>
>
> Today's Topics:
>
> 1. Issues concerning opensaf with TCP (William R Elliott)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 6 Sep 2019 21:07:22 +0000
> From: William R Elliott <[email protected]>
> To: "[email protected]"
> <[email protected]>, Lisa Ann Lentz-Liddell
> <[email protected]>, David S Thompson
> <[email protected]>
> Subject: [users] Issues concerning opensaf with TCP
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset="us-ascii"
>
> Hello,
>
> We are using opensaf version 5.1.0. We have a cluster using tcp as a
> transport mechanism with opensaf multicast feature enabled.
> We would appreciate answers to the following questions:
>
> 1. Please provide a link or any document that gives details on how the
> opensaf mds layer works.
>
> 2. osafntfd ER ntfs_mds_msg_send FAILED - Trace of the problem.
>
> Sep 3 19:57:26.107676 osafntfd [11558:NtfClient.cc:0202] <<
> notificationReceived Sep 3 19:57:26.107679 osafntfd
> [11558:NtfClient.cc:0147] >> notificationReceived: 108 2 Sep 3
> 19:57:26.107685 osafntfd [11558:NtfFilter.cc:0464] >> checkFilter Sep
> 3 19:57:26.107711 osafntfd [11558:ntfsv_mem.c:0769] >>
> ntfsv_get_ntf_header Sep 3 19:57:26.107721 osafntfd
> [11558:ntfsv_mem.c:0790] << ntfsv_get_ntf_header Sep 3 19:57:26.107726
> osafntfd [11558:NtfFilter.cc:0071] T8 numNotificationClassIds: 0 Sep 3
> 19:57:26.107729 osafntfd [11558:NtfFilter.cc:0056] T8 num EventTypes:
> 1 Sep 3 19:57:26.107732 osafntfd [11558:NtfFilter.cc:0060] T2
> EventTypes matches Sep 3 19:57:26.107735 osafntfd
> [11558:NtfFilter.cc:0187] T8 num notificationObjects: 0 Sep 3
> 19:57:26.107738 osafntfd [11558:NtfFilter.cc:0202] T8 num
> NotifyingObjects: 0 Sep 3 19:57:26.107741 osafntfd
> [11558:NtfFilter.cc:0223] T2 hdfilter matches Sep 3 19:57:26.107745
> osafntfd [11558:NtfFilter.cc:0087] T8 numSi: 0 Sep 3 19:57:26.107748
> osafntfd [11558:NtfFilter.cc:0471] << checkFilter Sep 3
> 19:57:26.107751 osafntfd [11558:NtfClient.cc:0184] T2
> NtfClient::notificationReceived notification 723 matches subscription
> 0, client 108 Sep 3 19:57:26.107756 osafntfd
> [11558:NtfNotification.cc:0105] T1 Subscription 0 added to list in
> notification 723 client 108, subscriptionList size is 1 Sep 3
> 19:57:26.107761 osafntfd [11558:NtfSubscription.cc:0211] >>
> sendNotification Sep 3 19:57:26.107764 osafntfd
> [11558:NtfSubscription.cc:0222] T3 send_notification_lib called,
> client 108, notification 723 Sep 3 19:57:26.107768 osafntfd
> [11558:ntfs_com.c:0284] >> send_notification_lib Sep 3 19:57:26.107771
> osafntfd [11558:ntfsv_mem.c:0769] >> ntfsv_get_ntf_header Sep 3
> 19:57:26.107774 osafntfd [11558:ntfsv_mem.c:0790] <<
> ntfsv_get_ntf_header Sep 3 19:57:26.107777 osafntfd
> [11558:ntfs_com.c:0286] T3 client id: 108, not_id: 723 Sep 3
> 19:57:26.107781 osafntfd [11558:mds_c_sndrcv.c:0396] >> mds_send Sep 3
> 19:57:26.107785 osafntfd [11558:mds_c_sndrcv.c:0403] << mds_send Sep 3
> 19:57:26.107788 osafntfd [11558:mds_c_sndrcv.c:0681] >> mds_mcm_send
> Sep 3 19:57:26.107791 osafntfd [11558:mds_c_sndrcv.c:0916] >>
> mcm_pvt_normal_svc_snd Sep 3 19:57:26.107794 osafntfd
> [11558:mds_c_sndrcv.c:0956] >> mcm_pvt_normal_snd_process_common Sep 3
> 19:57:26.107800 osafntfd [11558:mds_c_sndrcv.c:1699] >>
> mds_mcm_process_disc_queue_checks Sep 3 19:57:26.107804 osafntfd
> [11558:mds_c_sndrcv.c:1740] TR in else if sub_info->tmr_flag !- true
> Sep 3 19:57:26.107813 osafntfd [11558:mds_c_sndrcv.c:1747] TR
> MDS_SND_RCV:Subscription exists but no timer running Sep 3
> 19:57:26.107816 osafntfd [11558:mds_c_sndrcv.c:1749] TR MDS_SND_RCV :L
> mds_mcm_process_disc_queue_checks Sep 3 19:57:26.107819 osafntfd
> [11558:mds_c_sndrcv.c:1750] << mds_mcm_process_disc_queue_checks Sep 3
> 19:57:26.107900 osafntfd [11558:mds_c_sndrcv.c:1048] <<
> mcm_pvt_normal_snd_process_common Sep 3 19:57:26.107937 osafntfd
> [11558:mds_c_sndrcv.c:0937] >> mcm_pvt_normal_svc_snd Sep 3
> 19:57:26.107941 osafntfd [11558:mds_c_sndrcv.c:0846] << mds_mcm_send
> Sep 3 19:57:26.108141 osafntfd [11558:ntfs_mds.c:1290] ER
> ntfs_mds_msg_send FAILED Sep 3 19:57:26.108160 osafntfd
> [11558:ntfs_com.c:0308] ER ntfs_mds_msg_send to ntfa failed rc: 2 Sep
> 3 19:57:26.108165 osafntfd [11558:NtfNotification.cc:0142] T1 Removing
> subscription 0 client 108 from notification 723, subscriptionList size
> is 0 Sep 3 19:57:26.108169 osafntfd [11558:ntfs_com.c:0503] >>
> sendNotConfirmUpdate: client: 108, subId: 0, notId: 723
>
> a. The traces show that a notification is received for client id 108 and
> then a mds_send is tried for the same client id but it fails because there is
> no timer running.
> b. What does a client represent? A opensaf process? A SU? A component?
> c. What is the purpose of sending a message back after receipt of the
> notification? Since it is not sent it is discarded and does not seem to have
> any impact to the cluster.
> d. Hardcoded timers are defined in mds_main.c:
>
> uint32_t MDS_QUIESCED_TMR_VAL = 80;
> uint32_t MDS_AWAIT_ACTIVE_TMR_VAL = 18000; uint32_t
> MDS_SUBSCRIPTION_TMR_VAL = 500; uint32_t MDTM_REASSEMBLE_TMR_VAL =
> 500; uint32_t MDTM_CACHED_EVENTS_TMR_VAL = 24000;;
>
> Could each one of these be explained? Can any of these be increased? If yes,
> what effect would that have?
>
>
> 3. Are there limitations of a size of a cluster, the number of SGs, the
> number of SUs, the number of components per SU? Testing shows that as the
> number of SUs/components increase, the
> errors when starting the cluster and node appear/increase. Some errors that
> are seen from diff starts :
>
> Sep 6 16:16:09 host--s1-h2 osafimmnd[3119]: NO ERR_BAD_HANDLE: Admin
> owner 19 does not exist
>
> Sep 6 17:57:33 host--s1-h1 osafimmnd[2238]: WA MDS Send Failed to
> service:IMMND rc:2 Sep 6 17:57:33 host--s1-h1 osafimmnd[2238]: ER Problem in
> sending to peer IMMND over MDS. Discarding admin op reply.
> Sep 6 17:57:33 host--s1-h1 osafimmnd[2238]: WA Error code 2 returned
> for message type 21 - ignoring
>
> a. Some of our SUs have more than 30 components.
> b. All of the components of the cluster perform an IMM search on
> specific SUs to understand the state of some 2N SG to work with the active SU
> of those SGs.
> c. All components register for notifications so that they can react to
> the 2N SG state changes.
>
> 4. Warning of IMMND Client went down is seen on all nodes of the cluster:
>
> Sep 6 16:16:09 host--s1-h1 osafimmnd[31600]: WA IMMND - Client went
> down so no response Sep 6 16:16:09 host--s1-h1 osafimmnd[31600]: NO
> ERR_BAD_HANDLE: Admin owner 19 does not exist
>
> a) What went down? What is "client"? The IMMND did not die nor reboot?
> b) Can these messages be expanded to include more detail to truly
> understand what occurred?
>
>
>
>
> ________________________________
> The information transmitted herein is intended only for the person or entity
> to which it is addressed and may contain confidential, proprietary and/or
> privileged material. Any review, retransmission, dissemination or other use
> of, or taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from any
> computer.
>
>
> ------------------------------
>
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> Opensaf-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>
>
> ------------------------------
>
> End of Opensaf-users Digest, Vol 72, Issue 1
> ********************************************
>
------------------------------
------------------------------
Subject: Digest Footer
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users
------------------------------
End of Opensaf-users Digest, Vol 72, Issue 2
********************************************
________________________________
The information transmitted herein is intended only for the person or entity to
which it is addressed and may contain confidential, proprietary and/or
privileged material. Any review, retransmission, dissemination or other use of,
or taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received this
in error, please contact the sender and delete the material from any computer.
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users