On Wed, 23 Sep 2009 12:10:44 +0100
Alan Maguire <Alan.Maguire at Sun.COM> wrote:

> Michael Hunter wrote:
> > On Tue, 22 Sep 2009 16:40:46 +0100
> > Alan Maguire <Alan.Maguire at Sun.COM> wrote:
[...]
> > ncu.c:251 Why?  You could have received a state{less,full} address by
> > now.
> >
> >   
> This got me thinking. I've been going back and
> forth between two approaches in fixing these issues.
> The first, represented in this webrev, is to note if
> any static addresses are configured and go online
> directly if some are configured. The second is to
> leave that to the handle_if_state() function, so when
> it sees a static address assigned, it moves us online.
> The second  approach is better, being more general,

I hadn't thought along these lines but I agree it is more general.

I also think it is a specific example of a more general problem in our
implementation.  We do some stuff immediately and other stuff based on
an event/work queue.  This creates event sequences where cause and
effect get muddled.

> but I'd been seeing some weirdness with spurious
> RTM_DELADDRs following RTM_NEWADDRs
> (I haven't fully determined the source of these yet).
> 
> I've gone back to the latter approach, reinstating
> the code that drops RTM_DELADDR IF_STATE events
> which occur when the address isn't really gone. 

This is really using new info to filter old events.  It could be seen
as a way of debouncing the event stream.  I would rather figure out
where the extra events are coming from.  But as long as we document
what we are doing clearly I don't see an issue with doing this for now.

> Doing
> this gets rid of the issue where we were bouncing from
> online to offline as the mysterious RTM_DELADDRs
> made us think we had no addresses left (this only
> happens with static addresses which makes me
> think the cause is in the order in which we do
> things in add_ipaddr(), but I'm not sure. ifconfig.c
> sets prefixlen etc before setting the address, that
> may help here).

Write an CR to track.

> 
> In looking more closely at nwamd_ncu_handle_if_state_event()
> I've found a major issue - we're trying to map an
> address to a logical interface number by walking the
> current configuration to find which lifnum has the
> address assigned to it. Problem is, if it's an
> RTM_DELADDR, the address is gone and we bail

I remember seeing this and thought I'd gotten rid of it.  Sorry.

> from the IF_STATE handling function without
> marking the NCU down. The solution is just to use
> the logical interface name from the IF_STATE event.

Important to make sure that a deletion of an address just triggers us
reevaluating the NCU, not necessarily taking it down.

> I also noticed there was a bug in events.c where we
> try to set the nwe_name argument to the link name
> rather than the logical interface name (removing the
> :<lifnum>) but never do. I've fixed this so the
> nwe_name field is the link name and the nwe_ifname
> (a new field in the structure) contains the logical ifname.
> This has allowed us to remove a bunch of unneeded code
> and to icfg_open() the lif directly.

great!

> > ncu_ip.c:342 (curiosity) Was setting the flags back to what they were
> > causing a problem?
> >
> >   
> No, but when I was casting around for reasons
> for the spurious RTM_DELADDRs I figured
> mimimizing the cases where we change
> the flags made sense.

understood

> > ncu_ip.c:390 the prefix argument can go away since you are getting the whole
> > newaddr message
> >
> >   
> Good point, I've removed it.
> > ncu_ip.c:1286,1294,1294,1407 This must never happen in testing since
> > you alloc the address on the first go around and free it every time
> > after that ;)
> >
> >   
> Good catch! I've relocated the retry: label.
> 
> I've retested for autoconf, and for the various
> combinations of static/dhcp and static/autoconf
> v6, and will see if I can retest on John's rig later
> today when he's about. All looks good.
> Revised webrev at:
> 
> http://zhadum.east/export/ws/amaguire/nwam1-fixes/webrev/
> 
> Thanks for the review!

ncu_ip.c:1133 We probably can't get ourselves into the situation where
DHCP is on anything other then the physical interface, but we will
probably have to allow for it later.  We need a CR to track that issue.

ncu_ip.c:1224 The thing I dislike most about this is that we spin on
the retry.  There are probably a bunch of blocking operations between
the retry and the goto so it isn't a hard spin.  Maybe better if we can
schedule this via the work queue for a retry in a second?

Looks good.

        Michael

> 
> Alan

Reply via email to