On Thu, 2008-06-12 at 14:08 +0300, Olga Shern (Voltaire) wrote: > > > On 6/12/08, Hal Rosenstock <[EMAIL PROTECTED]> wrote: > Hi Olga, > > On Thu, 2008-06-12 at 09:46 +0300, Olga Shern wrote: > > Hi All, > > > > > > > > We have found something that seems like Infiniband Spec > hole, > > What's the spec hole ? > > According to the Infiniband spec - partial member cannot "talk" with > partial member only with full member. > Therefore if partial member sending MC packet - all other partial > members of this partition will generate BAD PKEY trap. > It means that the behavior that we see is according to Infiniband > Spec - but very problematic
Originally, multicast groups were all full member only and more recently was this extended to allow partial members and this was missed. A comment should be filed against the spec on this. > > This issue is system issue that prevents from partial P_Key > setup to > > go into production. > > Indeed :-( > > > Short Setup & test description: > > ------------------------------------------ > > * Node A: P_Key XXX (full member) > > * Node B, C, D, E, F: P_Key XXx (partial member) > > > > 1. Send ping from B -> A : ping is OK > > 2. Send ping from C -> A : ping is OK > > 3. Send ping from B -> C : no ping also OK > > * Get traps Bad P_Key in SM - from all HCA in the fabric > both for > > test 1 & 2 (one time) and also for test 3 (all the time). What does all the time mean ? Does this mean with one test 3 ping, the traps are repeated ? If so, at what rate ? > > Probably the ARP request that is MC traffic generate the > trap in HCA, > > for test 1 > > & 2 we have only one ARP but for test 3 we send ARP all the > time > > because > > we do not get any ARP reply. > > > > * The trap number SM get is 257 (HCA trap) if we will do > P_Key > > switch enforcement we will probably get 259 > > Is this with OpenSM or VSM ? > > We tested it with Voltaire SM but it should behave the same with > OpenSM. That's likely but I'm not sure yet. > -- Hal > > > * We get trap also from the originator of the MC traffic > even > > though that receive switch relay error counter is increased > (when out > > port==in port), the switch does not drop the packet ? The implementation of that counter is broken and occurs "normally". The increment of this counter is relatively meaningless :-( > > Additional questions/issues: > > * Do we have a way to suppress port traps from SMA ?? i.e. > that > > the port will not generate traps that can "kill the SM" - as > its look > > this is bug in the spec where we can't send any mc traffic > (even ARP) > > when we have partial members and we do not have a way to > suppress the > > traps. All the SM can do is TrapRepress. > > * What will happen in the HCA when we get many traps (mc > packets > > from many nodes) and they need to keep all events until SM > will > > acknowledge? - Is there limitation in the number of on- > going > > traps (any HCA specific issues)? Assuming you mean events from which traps are generated, I think this is left as an implementation dependent detail in terms of the spec. An implementation needs to take care not to lose certain events; others like this aren't critical but that's left to the specific SMA implementation. -- Hal > > > > > > > > > > Best Regards > > > > Olga > > > > > > _______________________________________________ > > general mailing list > > [email protected] > > http://lists.openfabrics.org/cgi- > bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
