Thiru,
Thanks a ton for your detailed feedback. I was only expecting you to look
at the IPSQ design, but I *certainly* appreciate the other comments.
You've found some interesting issues, many of which I've now addressed (or
had already stumbled on and addressed in the past month of development).
There are a few questions for you in my responses below.
> I haven't really looked at the arp changes. Do you create the
> ace_t only on the IPMP meta arl? Is this true even for our own
> ace_t's i.e. ACE_F_MYADDR?
Yes, all aces have an ace_arl of the IPMP meta-arl. They also have an
ace_xmit_arl that represents the physical arl they will transmit on.
> The request from IP still goes up on the physical interface stream?
It depends -- but it also really doesn't matter, as ARP knows what arl's
belong to which meta-arl's and thus can map the arl back to its meta-arl.
(ARP learns these relationships through the new AR_JOIN_IPMP and
AR_LEAVE_IPMP messages sent by IP.) Further, all ARP resolutions are
done across the entire group -- e.g., if ce0 resolve an address, the ACE
will be created against the IPMP arl, and if ce1 then requests resolution
of the same address, it will find the created entry.
> On IP's side, how does it look like ? the ire caches's ire_stq points
> to the physical interface, but the ire_ipif points to the IPMP meta
> interface ?
Correct.
> 1. L6561 ip6.c SIOCGTUNPARAM in ip_rput_v6 need not not go thru qwriter
Yep; same with OSIOCGTUNPARAM. Fixing this is part of Nevada bug 6516276,
which I will be sending out for code review soon (along with a bunch of
other IP fixes). I've also fixed it in this workspace for completeness.
> 2. There is some code in ip_rput_dlpi_writer() that handles the case when
> the DLPI response does not match what was expected. One possible reason
> for this is <ctrl-c> of an ioctl that was waiting. (L15091 ip.c)
> This can be moved to ip_rput_dlpi similar to the new checks you added in
> ip_rput_dlpi for checking whether the ACK is for the pending operation.
Yes, I've made that change as part 6516277, which is also part of the code
review I mentioned in (1). I've decided not to make the change here as
well, since it needs some other fixes to make it complete (e.g., 6522958
and 6522960), and I'll pull in all that stuff for free once I merge with
Nevada after I fix the bugs there.
> 3. ifconfig can use the dynamic ppa assignment code in kernel than going
> in a loop to find the next unused ppa for the IPMP interface.
> (L3887 ifconfig.c). We already have code in IP for this (for plumbing
> vlans)
> ifconfig needs to pass in UINT_MAX as ppa and ill_alloc_ppa() will do it.
Interesting. That would be a nice bit of logic to simplify; I will put it
on the TODO list.
> 4. ipif_set_values_tail() L18576
> if (IS_IPMP(ill))
> ..
> if (phyi->phyint_grp == NULL) {
> ASSERT(!ipmp_grp_exists(ill->ill_name));
> assert may trip if someone does
> ifconfig ipmp0 ipmp
> ifconfig ipmp0 group ipmp1
> ifconfig ipmp1 ipmp
Yes, that's a bug. I've changed the code to:
/*
* If someone has renamed another IPMP group to have
* the same name as our interface, then fail.
*/
if (ipmp_grp_exists(ill->ill_name)) {
rw_exit(&ipmp_g_lock);
return (EEXIST);
}
Now the above example produces:
ifconfig: SIOCSLIFNAME for ip: ipmp1: already exists
ifconfig: ipmp1: cannot create IPMP interface: already exists
> 5. L18590 ip_if.c may not be ok. V4 IPMP ill may be plumbed and now we may
> be trying to plumb V6. In this case we should not do ipmp_grp_destroy.
Good point. I've changed the code to track whether it created the group
earlier in the function, and then do:
illg = ipmp_illgrp_create(ill);
if (illg == NULL) {
--> if (ipmp_grp_created) {
ipmp_grp_destroy(phyi->phyint_grp);
phyi->phyint_grp = NULL;
}
rw_exit(&ipmp_g_lock);
return (ENOMEM);
}
This should be safe because ipmp_g_lock is held across the whole set of
operations.
> 6. L18606 ip_if.c
> /*
> * If the peer of the ill being plumbed is currently in an IPMP
> group,
> * but the IPMP meta-interface is not plumbed for the new ill type,
> * kick the peer out of the group.
>
> Failing SLIFNAME may be better than trying to yank out the peer ill from the
> group and causing a silent
This issue has gotten a little trickier in the latest bits, because the
logic has been moved to the end of SIOCSLIFNAME (in ip_rput_dlpi_writer(),
after everything else has been done), since the duplicate MAC address
detection logic done by ipmp_ill_join_illgrp() needs to be done after all
the DLPI dances have completed and ill_phys_addr is set. At that point,
it's basically impossible to abort the SIOCSLIFNAME.
That said, I think this particular issue is pretty minor, since I don't
think this case will ever happen in practice.
> 6. L18587 ip_if.c
> /*
> * Create and link up the illgrp, if necessary.
> */
> if (ill->ill_grp == NULL) {
>
> Wouldn't' ill_grp always be NULL since this is a brand new ill.
Yes, that check isn't needed and has been removed.
> 7. Need to restructure the code such that we send the routing socket
> messages after dropping the locks. ipmp_phyint_join_grp ->
> ipmp_ill_rtsaddrmsg holds the ipmp_g_lock and ill_lock while doing the
> putnext.
As per (9) below, in the latest code, ipmp_g_lock is no longer held here.
Not sure how to handle ill_lock, as I need to keep the list of ipifs on
the ill stable. Do you have a suggestion? I've added an IPMPXXX to the
code regarding this issue for the time being.
> 8. ipmp_phyint_join_grp()
>
> phyi->phyint_ipsq->ipsq_swxop =
> &grp->gr_phyint->phyint_ipsq->ipsq_ownxop;
> should be able to also assert
> /* IPMP pseudo interface never changes its xop */
> ASSERT(grp->gr_phyint->phyint_ipsq->ipsq_ownxop ==
> grp->gr_phyint->phyint_ipsq->ipsq_xop);
Good idea. I've added that.
> 9. ip_sioctl_groupname
> should be able to drop the ipmp_g_lock after the call to
> ipmp_phyint_join_grp(phyi, grp); before calling qwriter_ip
Yes. In fact, in the latest code, I added explicit gr_pendv4 and
gr_pendv6 counters to handle the race with ip_join_illgrps() and
ipmp_g_lock is dropped prior to the call to ipmp_phyint_join_grp():
/*
* Before we drop ipmp_g_lock, bump gr_pend* to ensure that
* the IPMP meta-interface ills needed by `phyi' cannot go
* away before ip_join_illgrps() is called back.
*/
if (phyi->phyint_illv4 != NULL)
grp->gr_pendv4++;
if (phyi->phyint_illv6 != NULL)
grp->gr_pendv6++;
--> rw_exit(&ipmp_g_lock);
ipmp_phyint_join_grp(phyi, grp);
ill_refhold(ill);
qwriter_ip(NULL, ill, ill->ill_rq, ipsq_mp, ip_join_illgrps,
SWITCH_OP, B_FALSE);
> 10. ipsq_enter/ ipsq_dq() / ipsq_exit.
>
> ipsq_dq: Need to drop ipsq_lock and the ill locks before the call to
> ipsq_exit(xopipsq,... at L7617. It may be good to add some assertions
> at the top of ipsq_exit to make sure we don't hold any lock at entry
> to ipsq_exit.
As per the comment above the ipsq_exit() call, I'm not sure why this would
be needed, since our ill and ipsq are no longer part of the group ipsq
being processed by that ipsq_exit().
> Need to become writer on the own xop after switching the xop
> before calling ipsq_exit() on the group xop at L7617. Otherwise nothing to
> prevent some other thread becoming writer on our xop
I don't see how this could happen -- we've still got our IPSQ locked, so
no one should be able to access our xop before we become writer on it as
part of step (4).
> ipsq_enter: also needs to respect the ipx_ipsq_queued
If so, it seems like there's a bug in Nevada, since ipsq_enter() in Nevada
does not respect ipsq_xopq_mphead. (ipx_ipsq_queued is essentially a
replacement for ipsq_xopq_mphead that allows the thread to check whether
any of the IPSQ's sharing the xop have ipsq_xopq_mphead != NULL.)
> 11. Need to restructure so that we postpone the putnext (could just
> construct a
> msg chain instead of the putnext) until we drop the ipmp_g_lock.
>
> ipmp_ill_deactivate -> ipmp_ill_bind_ipif -> ipif_resolver_up -> putnext
> ipmp_ill_deactivate -> ipmp_ill_unbind_ipif -> putnext.
> ipmp_ill_activate -> ipmp_ill_bind_ipif -> ipif_resolver_up -> putnext
>
> Maybe we just need to hold the ipmp_g_lock locally wheneve we need to
> manipulate lists or counts.(ig_if, ig_actif, associated counts etc.)
> and not through most of ipmp_ill_deactivate / ipmp_ill_activate. There is
> only 1 writer thread on the entire ipmp group doing this. A new ill trying
> to join the group can only proceed until some point and ip_join_illgrps
> happens only as writer on the ipmp group after the switch. An existing ill
> can leave the group only after becoming writer on the group.
Yes, I think that's the right answer. The main reason for ill_g_lock in
ipmp_ill_{de,}activate() is that some of the ig_* fields are accessed by
threads sending and receiving data (e.g., through ipmp_illgrp_next_ill()).
I will spend some time refactoring this.
> 12. consider the addres redistribution in ipmp_ill_activate
> Eg. IPMP ill = Addr A, B, C
> ill 1 -> Addr A, B,
> ill 2 -> Addr C.
> Now we add ill 3 and are in ipmp_ill_activate
> ill 3 is not yet in the ig_actif list.
> ill_ndata + 1 of ill 3 is 1 and is >= ill2's (minill) ill_ndata
> So we break out of the loop won't do any redistribution, when we
> could actually assign 1 address to each ill.
You've got eagle eyes :-) I found this bug just yesterday when I finally
started testing on a 3-interface machine. The code now reads:
for (;;) {
maxill = ipmp_illgrp_max_ill(illg);
ASSERT(maxill != NULL);
if (ill->ill_ndata + 1 >= maxill->ill_ndata) {
list_insert_tail(&illg->ig_actif, ill);
break;
}
ipif = ipmp_ill_unbind_ipif(maxill, NULL);
ipmp_ill_bind_ipif(ill, ipif, B_FALSE);
}
--
meem