recently our sm started throwing the following errors:

Jan 29 18:10:49 706710 [42003940] -> __get_new_mlid: ERR 1B23: All available:32 
mlids are taken
Jan 29 18:10:49 706721 [42003940] -> osm_mcmr_rcv_create_new_mgrp: ERR 1B19: 
__get_new_mlid failed
Jan 29 18:10:51 345113 [42804940] -> __get_new_mlid: ERR 1B23: All available:32 
mlids are taken
Jan 29 18:10:51 345132 [42804940] -> osm_mcmr_rcv_create_new_mgrp: ERR 1B19: 
__get_new_mlid failed
Jan 29 18:10:51 514312 [41802940] -> __get_new_mlid: ERR 1B23: All available:32 
mlids are taken
Jan 29 18:10:51 514320 [41802940] -> osm_mcmr_rcv_create_new_mgrp: ERR 1B19: 
__get_new_mlid failed
Jan 29 18:10:51 735732 [42804940] -> __get_new_mlid: ERR 1B23: All available:32 
mlids are taken

we tracked this down to a problem with ipoib interaction
with ipv6.  ipv6 joins two multicast groups, instead of 
just one like ipv4.

        # netstat -A inet6 -g  -n
        ...
        IPv6/IPv4 Group Memberships
        Interface       RefCnt Group
        --------------- ------ ---------------------
        lo              1      ff02::1
        ib0             1      ff02::1:ff00:77a2
        ib0             1      ff02::1


        # netstat -A inet6 -g  -n
        ...
        IPv6/IPv4 Group Memberships
        Interface       RefCnt Group
        --------------- ------ ---------------------
        lo              1      224.0.0.1
        ib0             1      224.0.0.1


        # cat /sys/kernel/debug/ipoib/ib0_mcg
        GID: ff12:401b:ffff:0:0:0:0:1
          created: 4298482097
          queuelen:         0
          complete:       yes
          send_only:       no

        GID: ff12:401b:ffff:0:0:0:ffff:ffff
          created: 4298482097
          queuelen:         0
          complete:       yes
          send_only:       no

        GID: ff12:601b:ffff:0:0:0:0:1
          created: 4298482097
          queuelen:         0
          complete:       yes
          send_only:       no

        GID: ff12:601b:ffff:0:0:1:ff00:77a2
          created: 4298482097
          queuelen:         0
          complete:       yes
          send_only:       no


the ff02::1:ff00:77a2 group is specific to the interface (link local),
so each of our ib hosts running ipv6 registers its own unique multicast
group.  since our network is bigger than 32 hosts, it appears that we
have exceeded the multicast tables in our local switches and this is
making opensm generate the above error.

besides not running ipv6, are there any thoughts about this?

_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to