Re: [networking-discuss] Start of code review for IP datapath refactoring

Sebastien Roy Mon, 28 Sep 2009 10:06:44 -0700

On Sun, 2009-09-27 at 14:01 -0700, Erik Nordmark wrote:
> Sebastien Roy wrote:
> 
> > usr/src/uts/common/inet/iptun/iptun.c
> > * 759: This looks like it belongs in conn_get_ixa() as there is only
> >   one possible value for the ixa_cred of the ixa returned by
> >   conn_get_ixa(), and that is conn_cred.
> 
> That code isn't needed; too much copy and paste. (It only applies to UDP 
> where we want to track the cred for each send* so that SCM_UCRED can be 
> used on the receive side.)
> However, re-reading the TX code I realize we need to always restart with 
> ixa_tsl = crgetlabel(ixa_cred) before we call tsol_check_dest(), but 
> that code belongs in ip_set_destination_*() where we call tsol_check_dest().


Okay.

> 
> > * 761-769: This also looks like something that should be in
> >   conn_get_ixa().
> 
> Ditto.  (it only applies to UDP ancillary data where the label change 
> for each packet due to SCM_UCRED).

Okay.

> 
> > * 768: Won't we need to set ixa_tsl on a per-packet basis?
> 
> Yes, we do that by calling tsol_check_label_* in the output path. But to 
> do that we need the initial label which is based on the zone.

Got it.

> > * 862: This can't happen because we call iptun_unbind() before
> >   iptun_bind() if the tunnel is already bound.  I suppose you could
> >   optimize things to make the iptun_unbind() unnecessary if you want.
> 
> I'll remove 862.

Perhaps an ASSERT() as well?

> > * 870,872: We should just have an ipcl_conn_insert() that calls
> >   ipcl_conn_insert_v[4|6]() depending on conn_ipversion.
> >   tcp_conn_request() and conn_connect() both have the same little
> >   if/else dance.
> 
> OK, but for uniformity it makes sense to do the same for ipcl_bind_insert.

Yes.

> > * 1795: There are some places in the code where ip_get_pmtu() is
> >   called with an argument of connp->conn_ixa directly without first
> >   getting a safe reference using conn_get_ixa() (in icmp_icmp_input()
> >   for example).  Is that deliberate?
> 
> We only call ip_get_pmtu() in one place, but there are places where we 
> use conn_ixa directly. However, currently that is only in iptun_create().
> Where did you see it be used?

The question was more about the usage in icmp_icmp_input(), where I had
noticed conn_ixa directly.  Is the usage there safe?

> > 
> > * 1976: What is this call to ip_attr_connect() about?
> 
> If ixa_ire was NULL we got a safe copy.
> 
> But I think I'll change the ixa_notify upcall to pass up the 
> ip_xmit_attr_t; that way we will have the currently used ip_xmit_attr_t 
> and can modify it directly which avoids the conn_get_ixa in 
> iptun_get_maxmtu.

Good idea.

> > * 2093: The ICMP error doesn't need to have the same cred as the
> >   original packet?
> 
> I read RFC 791 and there aren't credentials for IP packets ;-)
> 
> What is the purpose of the cred on the packet?
> We use them for getpeerucred() and SCM_UCRED. They have also been 
> overloaded to carry a label, but with refactoring we have the explicit 
> ixa_tsl for the label.

I'm simply concerned that the TX label associated with the ICMP error
will be correct.  If your code ensures that, then that's fine.

> > * 2150,2200: The cred associated with the conn_t won't necessarily be the
> >   same as the cred associated with the data.  The label we want in the
> >   packet is the one associated with the data, right?  Labels used to
> >   be computed using dblk creds...
> 
> I was hoping to already have merged with labeled IPsec, to be able to 
> answer authoritatively.
> Given that iptun in onnv-gate has the implicit side effect of preserving 
> db_credp hence the label from the original ICMP error, it makes sense to 
> do the same by explicitly passing the ts_label_t to the iptun_icmp 
> functions.

Okay.

> > * 2677: The first argument is always a conn_t *, so there appears to
> >   be no reason to pass in void * that is then set to a conn_t *.
> 
> conn_recv was introduced by FireEngine - conn_recvicmp just follows what 
> conn_recv started - it doesn't make sense to have them be different.
> And I've exceeded the quota for how much of FireEngine I'm allowed to 
> clean up ;-)

I don't see why they couldn't be different, but okay, it's your call.

> > * 2677: The third argument is always NULL, so why have a third argument?
> 
> It is non-NULL for tcp, just like conn_recv has a non-NULL third 
> argument for TCP.

I missed that.

> > * 2781: Can we use ira_ip_hdr_len to make this faster and simple?
> 
> It will be a bit faster at the expense of duplicated code. The 
> duplication would be to have
>       outer_hlen = ira->ira_ip_hdr_length;
>       if (ira->ira_flags & IRAF_IS_IPV4) {
>               outer4 = (ipha_t *)data_mp->b_rptr;
>               outer6 = NULL;
>       } else {
>               outer4 = NULL;
>               outer6 = (ip6_t *)data_mp->b_rptr;
>       }
>       if (MBLKL(data_mp) == outer_hlen) {
>               inner_mp = data_mp->b_cont;
>               ipha = (ipha_t *)inner_mp->b_rptr;
>       } else {
>               inner_mp = data_mp;
>               ipha = (ipha_t *)(data_mp->b_rptr + outer_hlen);
>       }
>       switch (IPH_HDR_VERSION(ipha)) {
>       case IPV4_VERSION:
>               if (inner_mp->b_wptr - (uint8_t *)ipha < sizeof (ipha_t))
>                       goto drop;
>               inner4 = ipha;
>               inner6 = NULL;
>               break;
>       case IPV6_VERSION:
>               if (inner_mp->b_wptr - (uint8_t *)ipha < sizeof (ip6_t))
>                       goto drop;
>               *inner4 = NULL;
>               *inner6 = (ip6_t *)ipha;
>               break;
>       default:
>               goto drop;
>       }
> thus a subset of what iptun_find_headers does.
> Is that what you want?

Not quite, I was thinking that iptun_find_headers() could take an
outer_hlen as input if that was already a known quantity.  That would
make the IPv6 input path a tad faster in the face of extension headers,
which there are by default (the encapsulation limit option).

> > * 2804: I believe the answer is yes, but danmcd should confirm.  In
> >   the old world, I believe an ipsec_in_t M_CTL would be pre-pended
> >   even on a clear-text packet if there was global policy present.
> >   This would cause ipsec_tun_inbound() to be called on such packets.
> 
> I'll check with Dan.

Okay.

> 
> >   One would thus assume that IRAF_IPSEC_SECURE should be set if there
> >   was global policy present in the ip input path(?).
> 
> That doesn't seem like the natural semantics of a flag that says SECURE 
> - it shouldn't be set if the packet was received in the clear.
> Perhaps an IRAF_IPSEC_HAS_GLOBAL_POLICY would make sense.

That's fine with me.

> 
> > * 2989: Don't we lose the db_credp here?
> 
> Yes, intentionally. db_credp in the new world is only needed for certain 
> messages going to sockfs for getpeerucred(), and in the case of TX for 
> certain messages going to rpcmod.
> Using db_credp less means less overloaded semantics.

Got it.

> 
> > * 3183,3276: We call iptun_find_headers() twice in a row for 6to4
> >   output processing.
> 
> The alternative would be to pass 5 more arguments (outer{4,6}, 
> inner{4,6} and outer_hlen) to iptun_output_common(). Is that better?

I suppose not.  I thought that perhaps this could be fixed by
reorganizing the calling chain, but it doesn't look that simple.

> 
> > * 3246-3259: Can this be done before calling iptun_output_common() so
> >   that iptun_output_common() becomes a tail-call?
> 
> ixa_refrele() can't be done until we are done with the ixa.
> 
> If you think the extra stack frame makes a big difference I can apply a 
> pragma inline.

Not a big deal; if it's not a trivial fix, I wouldn't go out of my way
for the stack frame savings.

> 
> > * 3297: I'm thinking that it might make sense to unconditionally
> >   include the labeling overhead in the link MTU if
> >   is_system_labeled().  This would simplify things a bit I think and
> >   would reduce the round trips necessary to make pmtu converge on
> >   labeled systems.  This isn't new with your changes, but just a
> >   fleeting thought.
> 
> The labeling overhead is not constant; different 6to4 destinations can 
> have CIPSO and not have CIPSO.
> I don't know if it would make sense to always subtract the maximum size 
> of a CIPSO option on a labeled system. Was that what you had in mind?

Yes (this is what IPsec overhead is; a maximum possible overhead), but I
admittedly don't know enough about TX labeling to know if it makes sense
to do something similar.  An RFE for another day.

Thanks,
-Seb


_______________________________________________
networking-discuss mailing list
[email protected]

Re: [networking-discuss] Start of code review for IP datapath refactoring

Reply via email to