On Mon, 2007-01-29 at 13:17, chas williams - CONTRACTOR wrote: > recently our sm started throwing the following errors: > > Jan 29 18:10:49 706710 [42003940] -> __get_new_mlid: ERR 1B23: All > available:32 mlids are taken > Jan 29 18:10:49 706721 [42003940] -> osm_mcmr_rcv_create_new_mgrp: ERR 1B19: > __get_new_mlid failed > Jan 29 18:10:51 345113 [42804940] -> __get_new_mlid: ERR 1B23: All > available:32 mlids are taken > Jan 29 18:10:51 345132 [42804940] -> osm_mcmr_rcv_create_new_mgrp: ERR 1B19: > __get_new_mlid failed > Jan 29 18:10:51 514312 [41802940] -> __get_new_mlid: ERR 1B23: All > available:32 mlids are taken > Jan 29 18:10:51 514320 [41802940] -> osm_mcmr_rcv_create_new_mgrp: ERR 1B19: > __get_new_mlid failed > Jan 29 18:10:51 735732 [42804940] -> __get_new_mlid: ERR 1B23: All > available:32 mlids are taken
32 is too low for MLID space support IMO. > we tracked this down to a problem with ipoib interaction > with ipv6. ipv6 joins two multicast groups, instead of > just one like ipv4. > > # netstat -A inet6 -g -n > ... > IPv6/IPv4 Group Memberships > Interface RefCnt Group > --------------- ------ --------------------- > lo 1 ff02::1 > ib0 1 ff02::1:ff00:77a2 > ib0 1 ff02::1 > > > # netstat -A inet6 -g -n > ... > IPv6/IPv4 Group Memberships > Interface RefCnt Group > --------------- ------ --------------------- > lo 1 224.0.0.1 > ib0 1 224.0.0.1 > > > # cat /sys/kernel/debug/ipoib/ib0_mcg > GID: ff12:401b:ffff:0:0:0:0:1 > created: 4298482097 > queuelen: 0 > complete: yes > send_only: no > > GID: ff12:401b:ffff:0:0:0:ffff:ffff > created: 4298482097 > queuelen: 0 > complete: yes > send_only: no > > GID: ff12:601b:ffff:0:0:0:0:1 > created: 4298482097 > queuelen: 0 > complete: yes > send_only: no > > GID: ff12:601b:ffff:0:0:1:ff00:77a2 > created: 4298482097 > queuelen: 0 > complete: yes > send_only: no > > > the ff02::1:ff00:77a2 group is specific to the interface (link local), > so each of our ib hosts running ipv6 registers its own unique multicast > group. since our network is bigger than 32 hosts, it appears that we > have exceeded the multicast tables in our local switches and this is > making opensm generate the above error. > > besides not running ipv6, are there any thoughts about this? This has been discussed on the list before. Last time was a thread on "IPv6 and IPoIB scalability issue" back in late November (11/30) to early December (12/2). There are some options presented. None have been pursued to the best of my knowledge. -- Hal > > _______________________________________________ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > _______________________________________________ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general