Peter Memishian writes: > This is happening because we end up scheduling a recovery timer when one > is already running, via this code in ip_ndp_excl():
ip_ndp_excl is invoked only when a currently up-and-running non-duplicate interface encounters a DAD failure. Someone else has claimed our address. ipif_recovery_id, by contrast, is set to timeout() ID only when the interface has been taken down by a DAD failure. Thus, up-and-running non-duplicate implies ipif_recovery_id == 0, and ipif_recovery_id != 0 implies IFF_DUPLICATE is set. Because of that context, there _should_ be no way to get here when ipif_recovery_id is non-zero or when a timeout is already running. > That is, I found that ipif_recovery_id was already pointing to a live > timeout when we set it above (through some ASSERTs I added, not shown). > So, the original ipif_recovery_id ends up getting lost, and thus if the > ipif is freed the timer remains, and when it finally fires some 30 seconds > later it corrupts an unrelated 320 byte buffer -- or worse. Yes, that'd be a problem, but it should not happen. > It seems that it was "by design" that a recovery timer should not be > running when ip_ndp_excl() was called (i.e., it's not a bug that we don't > check "ipif_recovery_id != 0" above). Correct. > Assuming that's the case, the bug > seems to be that for IPv6, if we go through ill_restart_dad() -> > ndp_do_recovery() -> ip_ndp_recover(), we forget to stop the recovery > timer. As a result, if the ipif again becomes a duplicate, then > ip_ndp_excl() will clobber ipif_recovery_id as previously described. This > doesn't happen for IPv4 since ill_restart_dad() goes through ARP and then > comes back via ip_arp_excl() which calls ipif_resolver_up() which cancels > the recovery timer. > > Please let me know what you think. I'm testing out a fix to this right > now using the stress test, and if you agree with the above analysis, I can > integrate it with the Clearview IPMP wad. Yes, I agree with that. At the point where IPIF_DUPLICATE is cleared in ip_ndp_recover, we should no longer have a running recovery timer. ndp_do_recovery could check this and cancel the timer before doing qwriter_ip on ip_ndp_recover. -- James Carlson, Solaris Networking <james.d.carlson at sun.com> Sun Microsystems / 35 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
