On Mon, 2007-01-29 at 13:17, chas williams - CONTRACTOR wrote:
> recently our sm started throwing the following errors:
> 
> Jan 29 18:10:49 706710 [42003940] -> __get_new_mlid: ERR 1B23: All 
> available:32 mlids are taken
> Jan 29 18:10:49 706721 [42003940] -> osm_mcmr_rcv_create_new_mgrp: ERR 1B19: 
> __get_new_mlid failed
> Jan 29 18:10:51 345113 [42804940] -> __get_new_mlid: ERR 1B23: All 
> available:32 mlids are taken
> Jan 29 18:10:51 345132 [42804940] -> osm_mcmr_rcv_create_new_mgrp: ERR 1B19: 
> __get_new_mlid failed
> Jan 29 18:10:51 514312 [41802940] -> __get_new_mlid: ERR 1B23: All 
> available:32 mlids are taken
> Jan 29 18:10:51 514320 [41802940] -> osm_mcmr_rcv_create_new_mgrp: ERR 1B19: 
> __get_new_mlid failed
> Jan 29 18:10:51 735732 [42804940] -> __get_new_mlid: ERR 1B23: All 
> available:32 mlids are taken

32 is too low for MLID space support IMO.

> we tracked this down to a problem with ipoib interaction
> with ipv6.  ipv6 joins two multicast groups, instead of 
> just one like ipv4.
> 
>       # netstat -A inet6 -g  -n
>       ...
>       IPv6/IPv4 Group Memberships
>       Interface       RefCnt Group
>       --------------- ------ ---------------------
>       lo              1      ff02::1
>       ib0             1      ff02::1:ff00:77a2
>       ib0             1      ff02::1
> 
> 
>       # netstat -A inet6 -g  -n
>       ...
>       IPv6/IPv4 Group Memberships
>       Interface       RefCnt Group
>       --------------- ------ ---------------------
>       lo              1      224.0.0.1
>       ib0             1      224.0.0.1
> 
> 
>       # cat /sys/kernel/debug/ipoib/ib0_mcg
>       GID: ff12:401b:ffff:0:0:0:0:1
>         created: 4298482097
>         queuelen:         0
>         complete:       yes
>         send_only:       no
> 
>       GID: ff12:401b:ffff:0:0:0:ffff:ffff
>         created: 4298482097
>         queuelen:         0
>         complete:       yes
>         send_only:       no
> 
>       GID: ff12:601b:ffff:0:0:0:0:1
>         created: 4298482097
>         queuelen:         0
>         complete:       yes
>         send_only:       no
> 
>       GID: ff12:601b:ffff:0:0:1:ff00:77a2
>         created: 4298482097
>         queuelen:         0
>         complete:       yes
>         send_only:       no
> 
> 
> the ff02::1:ff00:77a2 group is specific to the interface (link local),
> so each of our ib hosts running ipv6 registers its own unique multicast
> group.  since our network is bigger than 32 hosts, it appears that we
> have exceeded the multicast tables in our local switches and this is
> making opensm generate the above error.
> 
> besides not running ipv6, are there any thoughts about this?

This has been discussed on the list before. Last time was a thread on
"IPv6 and IPoIB scalability issue" back in late November (11/30) to
early December (12/2). There are some options presented. None have been
pursued to the best of my knowledge.

-- Hal

> 
> _______________________________________________
> openib-general mailing list
> openib-general@openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


_______________________________________________
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to