On Sat, 2005-09-24 at 16:43, Eitan Zahavi wrote: > Well, if this is the case then OpenSM is might stop responding due to the > following features: > 1. We had in the past cases where bad hardware continuously flooded the SM > with Traps. > To protect against this kind of DOS attack we have implemented an > adaptive filter in > the SM trap receiver: > If the exact same trap is received continuously from same source more > then 10 times > (with no more then of 5sec between the traps) they are considered DOS and > are ignored. > Please see osm_trap_rcv.c for details. > 2. The way IB switches work is that each time a port of their changes state > they: > a. Set the "change bit" in the SwitchInfo > b. Send a trap 128 to the SM. But Trap 128 does not carry the changed > port number. > > So under a test case like you describe what can happen: > 1. The SM decides to ignore trap 128 from the switch as more then 5 > connect/reconnect sequences > happen with not enough "quite" time to recover. > 2. The SwitchInfo ChangeBit is sampled during the OSM light sweep. There is a > race between the > reading of the change bit and the clearing of it. If the connect > disconnect happen very fast > the change bit set by the re-connect can be cleaned by the clear starting > by the disconnect. > > It is easy to see in the log file if the SM did ignore traps. Run with -V and > look for: > grep "Continuously received this trap" /var/log/osm.log
This is what is happening. So the policy is 5 reconnect sequences without coming up ? What's not quite enough time for recovery Is this settable ? > (for some reason I did not get any log attachments with this thread - > otherwise I would > do some analysis on it too). I will forward separately. This was too big for the list. -- Hal _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
