Aha I have seen this one before. This is a behavior difference between MDS/TCP 
and MDS/TIPC. With TIPC we get flow control by having a blocking send. In this 
case obviously not. I remember that I have seen this before. Any kind of bursty 
send would trigger this. For example a LOG burst of async messages.

Why can't send be blocking in the MDS/TCP case and this queue removed?

I wonder if I did write some ticket on this... 


---

** [tickets:#1157] MDS: IMMD coredumps in MDS BCAST send (TCP with MCAST_ADDR)**

**Status:** unassigned
**Milestone:** 4.5.0
**Created:** Tue Oct 07, 2014 12:57 AM UTC by Adrian Szwej
**Last Updated:** Tue Oct 07, 2014 09:21 AM UTC
**Owner:** nobody

Changeset: **4.6.M0 - 6009:b2ddaa23aae4**
When starting ~50 linux containers IMMD coredumps resulting in cluster reset.
Communication is TCP.
dtmd.conf configuration is:

    DTM_SOCK_SND_RCV_BUF_SIZE=65536
    DTM_CLUSTER_ID=1
    DTM_NODE_IP=172.17.1.42
    DTM_MCAST_ADDR=224.0.0.6

BatchSize reduced to 4096

    opensafImm=opensafImm,safApp=safImmService
    Name                                               Type         Value(s)
    ========================================================================
    opensafImmSyncBatchSize                            SA_UINT32_T  4096 
(0x1000)

When node PL-51 joins the cluster the following messages is seen in the syslog:

    Oct  6 00:35:57 SC-1 osafdtmd[1028]: NO Established contact with 'PL-51'
    Oct  6 00:35:57 SC-1 osafimmd[1063]: NO Extended intro from node 2330f
    Oct  6 00:35:57 SC-1 osafimmd[1063]: NO Node 2330f request sync sync-pid:79 
epoch:0 
    Oct  6 00:35:58 SC-1 osafimmnd[1072]: NO Announce sync, epoch:292
    Oct  6 00:35:58 SC-1 osafimmnd[1072]: NO SERVER STATE: IMM_SERVER_READY --> 
IMM_SERVER_SYNC_SERVER
    Oct  6 00:35:58 SC-1 osafimmnd[1072]: NO NODE STATE-> IMM_NODE_R_AVAILABLE
    Oct  6 00:35:58 SC-1 osafimmd[1063]: NO Successfully announced sync. New 
ruling epoch:292
    Oct  6 00:35:58 SC-1 osafimmloadd: NO Sync starting
    Oct  6 00:36:00 SC-1 osafimmd[1063]:  MDTM unsent message is more!=200
    Oct  6 00:36:00 SC-1 osafimmnd[1072]: WA Director Service in NOACTIVE state 
- fevs replies pending:9 fevs highest processed:20037
    Oct  6 00:36:00 SC-1 osafamfnd[1143]: NO 
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : 
Recovery is 'nodeFailfast'
    Oct  6 00:36:00 SC-1 osafamfnd[1143]: ER 
safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery 
is:nodeFailfast
    Oct  6 00:36:00 SC-1 osafamfnd[1143]: Rebooting OpenSAF NodeId = 131343 EE 
Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 
131343, SupervisionTime = 60
    Oct  6 00:36:00 SC-1 opensaf_reboot: Rebooting local node; timeout=60
    Oct  6 00:36:00 SC-1 osafimmnd[1072]: NO No IMMD service => cluster 
restart, exiting

There is a coredump generated:
core_1412555760.osafimmd.1063





---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to