Peter Memishian writes:
> This is happening because we end up scheduling a recovery timer when one
> is already running, via this code in ip_ndp_excl():

ip_ndp_excl is invoked only when a currently up-and-running
non-duplicate interface encounters a DAD failure.  Someone else has
claimed our address.

ipif_recovery_id, by contrast, is set to timeout() ID only when the
interface has been taken down by a DAD failure.

Thus, up-and-running non-duplicate implies ipif_recovery_id == 0, and
ipif_recovery_id != 0 implies IFF_DUPLICATE is set.

Because of that context, there _should_ be no way to get here when
ipif_recovery_id is non-zero or when a timeout is already running.

> That is, I found that ipif_recovery_id was already pointing to a live
> timeout when we set it above (through some ASSERTs I added, not shown).
> So, the original ipif_recovery_id ends up getting lost, and thus if the
> ipif is freed the timer remains, and when it finally fires some 30 seconds
> later it corrupts an unrelated 320 byte buffer -- or worse.

Yes, that'd be a problem, but it should not happen.

> It seems that it was "by design" that a recovery timer should not be
> running when ip_ndp_excl() was called (i.e., it's not a bug that we don't
> check "ipif_recovery_id != 0" above).

Correct.

>  Assuming that's the case, the bug
> seems to be that for IPv6, if we go through ill_restart_dad() ->
> ndp_do_recovery() -> ip_ndp_recover(), we forget to stop the recovery
> timer.  As a result, if the ipif again becomes a duplicate, then
> ip_ndp_excl() will clobber ipif_recovery_id as previously described.  This
> doesn't happen for IPv4 since ill_restart_dad() goes through ARP and then
> comes back via ip_arp_excl() which calls ipif_resolver_up() which cancels
> the recovery timer.
> 
> Please let me know what you think.  I'm testing out a fix to this right
> now using the stress test, and if you agree with the above analysis, I can
> integrate it with the Clearview IPMP wad.

Yes, I agree with that.  At the point where IPIF_DUPLICATE is cleared
in ip_ndp_recover, we should no longer have a running recovery timer.

ndp_do_recovery could check this and cancel the timer before doing
qwriter_ip on ip_ndp_recover.

-- 
James Carlson, Solaris Networking              <james.d.carlson at sun.com>
Sun Microsystems / 35 Network Drive        71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

Reply via email to