On Fri, Sep 19, 2008 at 2:28 PM, Roger Spellman <[EMAIL PROTECTED]> wrote: > Sasha, > I am running OFED 1.3.1. > > My SN Manager is opensmd. /var/log/opensm.log shows the following: > > Sep 19 14:21:19 480217 [43806960] 0x02 -> SUBNET UP > Sep 19 14:21:19 818276 [41001960] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x04 > num:144 Producer:1 (Channel Adapter) from LID:0x0011 > TID:0x0000000000000000 > Sep 19 14:21:19 818330 [41001960] 0x02 -> osm_report_notice: Reporting > Generic Notice type:4 num:144 from LID:0x0011 > GID:0xfe80000000000000,0x0002c9020027d451 > Sep 19 14:21:19 823408 [43806960] 0x02 -> osm_ucast_mgr_process: minhop > tables configured on all switches > Sep 19 14:21:19 827220 [43806960] 0x02 -> SUBNET UP > Sep 19 14:21:27 283873 [41802960] 0x01 -> __osm_mcmr_rcv_join_mgrp: ERR > 1B12: __validate_more_comp_fields, __validate_port_caps, or JoinState = > 0 failed from port 0x0002c9020026e4c1 ( HCA-1), sending > IB_SA_MAD_STATUS_REQ_INVALID > Sep 19 14:21:43 298367 [42804960] 0x01 -> __osm_mcmr_rcv_join_mgrp: ERR > 1B12: __validate_more_comp_fields, __validate_port_caps, or JoinState = > 0 failed from port 0x0002c9020026e4c1 ( HCA-1), sending > IB_SA_MAD_STATUS_REQ_INVALID > Sep 19 14:21:59 312765 [42003960] 0x01 -> __osm_mcmr_rcv_join_mgrp: ERR > 1B12: __validate_more_comp_fields, __validate_port_caps, or JoinState = > 0 failed from port 0x0002c9020026e4c1 ( HCA-1), sending > IB_SA_MAD_STATUS_REQ_INVALID
It's likely a rate issue where the negotiated port rate is not the broadcast group rate. What does ibstat or ibstatus show when the join fails ? Also, what about saquery -g ? > > Rebooting the node that failed to join the group always seems to solve > the problem. Yes, that's consistent with the negotiated rate being a problem. -- Hal > Thanks for your help. > > -Roger > >> -----Original Message----- >> From: Sasha Khapyorsky [mailto:[EMAIL PROTECTED] >> Sent: Friday, September 19, 2008 1:06 PM >> To: Roger Spellman >> Cc: [email protected] >> Subject: Re: [ofa-general] Intermittent: ib0: multicast join failed >> >> On 16:45 Thu 18 Sep , Roger Spellman wrote: >> > I have many nodes, each with a Mellanox MT25204. When I reboot some >> > nodes, they occasionally get the following error: >> > >> > ib0: multicast join failed >> >> What is the software stack? Which version? >> >> > Rebooting the system almost always solves this problem. >> > >> > What causes this? >> >> What are SM you using? If it is OpenSM you can see in the log >> (/vat/log/opensm.log) why the join failed. >> >> > Is there a way to solve this without rebooting? >> >> Hard to say - the reason for failure is unknown. I could be port's low >> speed/width or something else, hard to say without any details. >> >> Sasha > _______________________________________________ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
