On Mon, 2009-08-24 at 15:43 -0700, Peter Memishian wrote: > Seb, > > Here's my final round of feedback; sorry for being late. In general I > found the code to be an excellent refactoring of a thorny area -- nice > work! (There are a few IPsec-related pieces that seemed overcomplicated, > but looking at Nevada's tun.c they look at least as confusing before, so > on the whole it's still an improvement even in those areas.)
Thank you. > > Webrev: > > http://cr.opensolaris.org/~seb/clearview-iptun-cr3/ > > usr/src/uts/common/inet/iptun/iptun.c: > > * 47: s/iptun_task_cv/iptun_task_cb/ ACCEPT > > * 51, elsewhere: Calling these primitives iptun_hold() and > iptun_release() is very confusing since those are usually the > names used for reference counting, which is of course totally > orthogonal to locking. Suggest iptun_enter() and iptun_exit(). ACCEPT: will call them iptun_enter() and iptun_exit(). > > * 62: What in <sys/strsubr.h> does this file rely on? That header > contains symbols private to the STREAMS framework. ACCEPT: On a related note, there are over 300 files outside of the framework including this file in the kernel. Something seems generally amiss if it's not meant to be included by anything outside of the framework. > * 77: What in <sys/ip_impl.h> does this file rely on? ACCEPT: Nothing at this point. I think it was used at one point to take advantage of the IP_HDR_CKSUM(), but that's not used . Removed. > > * 84-86: This will erroneously consider the tuple > <IPTUN_TYPE_UNKNOWN, AF_INET> a match; I presume that never > happens, but a more explicit check could catch bugs earlier. ACCEPT > > * 88: Appears to be an older version of IPTUN_HASH_KEY(). ACCEPT: moved IPTUN_HASH_KEY here and removed IPCL_HASH_KEY. > > * 90-94: Seems like something that belongs in <inet/ip.h> ACCEPT: Okay, I'll move this. > > * 110: Either this belongs in <netinet/in.h> or it shouild be > prefixed with IPTUN_. ACCEPT: Adeed an IPTUN_ prefix. > > * 129: Why is this IPTUN_MIN_IPV6_MTU and not IPTUN_MIN_IPV4_MTU? Because only IPv6 interfaces are plumbed above 6to4 links. > > * 142: Unclear why iptun_hash_lock is not `static'. ACCEPT: It is now. > > * 144: Likewise, unclear why iptun_tunnelcount is not `static'. ACCEPT: It can be, and now it is. > > * 226: Why is it not necessary to dispatch a IPTUN_TASK_LINK_UPDATE > to update the link state? ACCEPT: This was an oversight, it should be done. > > * 251: Will this force users of iptun_m_unicst() to have special > code to deal with iptun? Is it inappropriate to map this to > iptun_setsrc()? The existing interface is there mainly for the benefit of callers of mac_unicast_primary_set(), which can handle failure (they won't need special code for IP tunnels). The only such caller that really applies to IP tunnels is the handling of DL_SET_PHYS_ADDR_REQ in proto_setphysaddr_req(). Using this to set the tunnel source (without setting the tunnel destination) is meaningless anyway, which is why I didn't add support for it. > > * 259: Why EINVAL rather than ENOTSUP? ACCEPT: Changed to ENOTSUP. > > * 313-316: Why is this flag needed? Seems like you could just > test whether iptun_hoplimit == IPTUN_DEFAULT_HOPLIMIT in > iptun_headergen(). ACCEPT > > * 341: Unclear why we have this short-circuit path for MTU, but > not for hoplimit or encaplimit -- but as a result it seems one > cannot set a fixed MTU if it happens to be the same value as the > current (dynamic) MTU. ACCEPT: Will add check for other properties, and will set the the IPTUN_FIXED_MTU flag if the mtu is valid regardless of equality with the current value. > * 344: Given that `maxmtu' is dynamic, this seems a problematic > property to set programmatically since a call may or may not > fail depending on dynamic conditions. Further, if the call is > made in while maxmtu is a large value, that value remains set > even when maxmtu shrinks, which seems asymmetric. (IP has many > of the same problems with its current MTU handling.) I agree that this has problems, and I struggled with coming up with a good solution. I don't know that it's possible, but suggestions are welcome. We could set the maximum to the maximum IP datagram size minus tunneling overhead (which is in the tens of thousands of bytes), and assume that administrator will have enough sense to set a value that works (given that most values above the current dynamically computed MTU most likely won't work). This would allow the administrator to override the system default in a symmetric way (they can use any value at any time, regardless of the currently dynamically computed maximum). Do you think that makes more sense? > * 349: No way to cause IPTUN_FIXED_MTU to be cleared -- though the > same limitation also exists today with IP interfaces. This is a limitation of the property kernel API. There is no way to know that the user has done a "reset". It just comes down as a set with a default value. > > * 374-381: Would've expected this to be done by GLDv3, not by each > individual driver (sigh). There are indeed warts in this driver API. > > * 430: Would've expected the default to be the value that's > initialized in iptun_create() -- i.e. iti_maxmtu. This is related to your earlier comment. The current maximum is equal to the current dynamically computed maximum. If we decide that such a maximum is artificial and asymmetric, an absolute maximum could be provided (like iti_maxmtu). > > * 452-453: Delete extra blank lines. ACCEPT > > * 490: Might be more accurate to name this iptun_enter_by_linkid() > or somesuch. (To someone used to standard terminology, this > looks like a standard lookup function that acquires a reference, > when in fact it holds a lock.) ACCEPT > > * 492: Needless initialization. ACCEPT > > * 507, 510, 514: s/defered/deferred/ ACCEPT > > * 558: Is there any other reason (besides a bug) that the lookup > could fail? Yes, the tunnel could have been deleted while the task was sitting in the taskq. > > * 611-612: Seems like this should be an ASSERT(0) -- and probably > moved to the first `switch' statement. ACCEPT > > * 634: `itd' leaked on ddi_taskq_dispatch() failure. ACCEPT > > * 761: Seems needless to call iptun_taskq_dispatch() here if > IPTUN_CONDEMNED is set. ACCEPT > > * 768: Unclear why iptun_headergen() returns a value since it > cannot fail. Can also remove associated error handling at > 317-322, 335, 1362, and 1523. ACCEPT: Code evolved into a simpler structure, and the error handling factored itself out. :-) > > * 840: Function name seems too general. Moreover, it seems like > there's one or more missing IPsec utility functions, as this > code looks tedious and repetitious. ACCEPT: It is, and I also noticed that it inserts IPsec policy for both IPv4 and IPv6, which doesn't make sense to me. A tunnel can only transmit one or the other depending on its type. This code look identical to the ipsec_set_req() code that sets per-socket policy. I've created an ipsec_insert_policy() in ip.c that this and ipsec_set_req() now both call. > > * 865, 876, 885: No need to clean up partial established policy > on failure? Maybe that's handled by ipsec_swap_policy() in > the caller? This code is really obscure. This is done by ipsec_polhead_flush() in the caller. > > * 894-895: No idea what this comment means (and this function > could certainly benefit from a clear comment here). ACCEPT: I've done my best to clarify the comment. I have no idea what "new world" Dan was referring to, so I removed the last part of that comment. > > * 909: Is this really worth three exclamations? ACCEPT: No > > * 910: Supposed to be "&" rather than "&&"? Still not sure I > understand this check though -- e.g., why is it OK for > ipsr_self_encap_req to be IPSEC_PREF_UNIQUE? ACCEPT: I think you're right, there are two bugs on that line. It should be (ipsr->ipsr_self_encap_req != 0). > > * 925: Is this call to dls_mgmt_get_linkinfo() safe given that > iptun_lock is held? Yes. This isn't an "upcall" since it's just a utility function in the dls module. It doesn't enter this link's perimeter. > > * 935: Cast is only necessary because ipsec_actvec_from_req() > has a broken function signature. ACCEPT: Will constify ipsec_actvec_from_req() and ipsec_prot_from_req() > > * 949-991: The locking here might be fine, but I can't make heads > or tails of it. Seems like a lot of IPsec machinery is directly > exposed to consumers. DEFER: There is quite a bit of direct manipulation of internal data structures from the IPsec implementation here, but I don't think there's much that I can do about it short of implementing a higher level kernel API for this stuff (there doesn't appear to be one). > > * 1016: Could be `static'. ACCEPT > > * 1032: From the perspective of the caller, it's unclear what > iptun_setparams() may have done if the call fails and multiple > flags were set in itk_flags, which seems to limit the usefulness > being able to set multiple params at once. ACCEPT: Will have iptun_setparams() restore the iptun_t to its original state on failure. > > * 1043, 1057: Outer `if' statement seems useless. ACCEPT > > * 1079-1092: Punting this to the hapless caller via EAGAIN > complicates the interface with an implementation artifact. > Seems like another mechanism is needed here. (IIRC, GLDv3 > ioctls are not allowed to block, which means the mechanism > might need to be quite a bit different.) DEFER: I've brought this up with the IPsec team and Erik Nordmark, and the consensus was that since we'll never run into this problem (you need IPsec keys and policy to use IPsec tunnels, and the creation of those objects will have loaded IPsec before tunnels are created), it was not worth my re-architecting how IPsec is loaded at this point. > > * 1098: As an aside, I'm unclear what constitutes a "simple" > policy. Is this defined somewhere? ACCEPT: Simple policy is policy described in the ipsec_req_t embedded in an iptun_kparams_t, which originates from the ifconfig command line. It's "simple" because it can only state "do ESP on everything" or "do AH on everything" (as opposed to the rich policy language offered by ipsecconf). I've added a comment. > > * 1105: Can we make it here with `err' set to anything but zero? ACCEPT: I've removed the else clause at 1100, so now we can. :-) > > * 1127: Why is m_min_sdu set to zero rather than > iptun->iptun_typeinfo->iti_minmtu? ACCEPT: Fixed > > * 1133-1136: Seems more straightforward as: > > if ((err = mac_register(mac, &iptun->iptun_mh)) == 0) > iptun->iptun_flags |= IPTUN_MAC_REGISTERED; > mac_free(mac); ACCEPT > > * 1192: Would seem more to the point to add a `default' clause to > the `switch' staement. ACCEPT > > * 1197: This ASSERT() needs to be moved to 1193: we've > cleared CONN_INCIPIENT and dropped conn_lock, so conn_ref can be > increased by another thread. ACCEPT > > * 1243-1244: Nothing seems to rely on iptun_free(NULL) working. ACCEPT: Removed > > * 1245: Would be nice to ASSERT() IPTUN_CONDEMNED is set. ACCEPT > > * 1265: Would seem reasonable to clear IPTUN_HASH_INSERTED here. ACCEPT > > * 1339: What protects multiple simultaneous opens from both > finding iptuns_g_q clear and both proceeding to set it? > (Seems like iptuns_lock should protect this field.) ACCEPT > > * 1340: Why is the default STREAMS queue instance associated with > the credentials of the iptun_create() call? Would've expected > zone_kcred(). ACCEPT > > * 1448: EACCES usually pertains to file permissions; would've > expected EPERM (and that's done at e.g. 1503). My understanding (from existing usage and various discussions where this debate was had in the past) is that EPERM indicates that the caller doesn't have the permissions to perform an operation, and EACCES indicates that the caller is denied access to the object being operated on. My understanding would thus lead me to change 1503 to EACCES... > > * 1467, 1481: Still need to hold iptun_lock when clearing > IPTUN_CONDEMNED to guarantee the flag change will be visible to > other threads (usually done as part of the memory barrier as > part of mutex_exit()). I'm not sure I fully understand the problem you're pointing out nor the solution you're proposing. I can't hold the lock while calling dls_devnet_destroy() nor mac_disable(), so I have to drop the lock at 1464. Are you suggesting that I need to hold the lock through the calls to dls_devnet_destroy() and mac_disable() (and incur deadlock), or that I wrap the flag change itself around iptun_mutex? If it's the latter, then I'm not sure how this would matter. Other threads that see the flag set will return with an error just as they would have if they had attempted to access the iptun_t a few moments earlier, say at 1465. Can you clarify what you mean? > > * 1482: What happens if dls_devnet_create() fails? Is it > appropriate to panic here? This worked differently and quite well until 6791335 was introduced. The aggr, vnic, and simnet drivers (and now forcibly iptun) all have this feeble attempt at re-creating their universe if mac_disable() fails. The link will be unusable as a result, but the system will otherwise be usable. It's probably not worth panicing the system over (this can actually legitimately happen). > > * 1568: Extra blank line? ACCEPT > > * 1572-1611: Seems like a general-purpose iptun_getaddr() function > or the like would be handy here. ACCEPT > > * 1584, 1604: Isn't the caller (e.g., userland) responsible for > setting these flags? Seems like these would be EINVAL, not an > ASSERT(). No, the caller doesn't set the iptun_flags, iptun_setparams() does that. The iptun_t could only have had IPTUN_DST set by iptun_setparams() if its type supports a destination (iti_hasdst) (note the check for iti_hasdst in iptun_setdst()). > > * 1619: OK to access itp_flags without itp_lock held? ACCEPT: I'm guessing not seeing as ipsecconf could come in and change things (Dan?). > > * 1620: I'm confused by the name/semantics of this flag -- the > code seems to use it to indicate that there is IPsec policy, > regardless of whether ipsecconf needs to be used -- e.g., > ifconfig only advises using "ipsecconf -ln" when ITK_IPSECCONF > is set and ITK_SECINFO is clear. That's right, ITK_IPSECCONF means that there is policy covering this tunnel. ITK_SECINFO indicates that the simple policy set through ifconfig is stored in itk_secinfo. > > * 1632, 1639: Why do these functions allow an error to be returned? ACCEPT: Fixed. > > * 1658: If IPsec policy is removed entirely, what codepath leads to > iptun_update_mtu() being called? ACCEPT: Dan, how should IPsec policy get removed? > > * 1684, 1689: MATCH_IRE_DEFAULT means we won't necessarily return > the path MTU. Probably fine, but the result is not the path MTU. Hmm, I guess I could use ire_ctable_lookup() instead, but it's probably not worth changing given that datapath refactoring is coming right behind and rewriting all of this using a different mechanism. > > * 1701: The function name traverse_rules() seems to have little to > do with what this function does (find the highest overhead). ACCEPT: Renamed to iptun_max_policy_overhead(). > > * 1714: Something like iptun_get_ipsec_overhead() would be more > consistent with other names. ACCEPT > > * 1736-1750: Seems like a utility function (in IPsec-land) that > initializes an ipsec_selector_t from a provided IPv4/IPv6 > address and protocol would be useful. DEFER: The savings are minimal, I'll defer this to another time when refactoring selector initialization in spd.c could be more safely done. > > * 1744: Takes as many lines of code to just do the obvious > assignment of B_FALSE ;-) ACCEPT > > * 1768: "For now" meaning there's something else we should ideally > be doing? Perhaps the comment could be expanded? ACCEPT: I don't know what Dan meant by "For now" (Dan?). > > * 1775: Could just pass 0 here rather than ipsec_ovhd. ACCEPT > > * 1804, 1817: FWIW, there is a large project underway to remove > direct use of lbolt; would advise using ddi_get_lbolt() instead. > (Likewise with the usage in iptun_impl.h.) ACCEPT > > * 1875: The comment is obvious from the code -- would be more > useful if the comment said why. ACCEPT /* * We only dynamically adjust the tunnel MTU for tunnels with * destinations because dynamic MTU calculations are based on the * destination path-MTU. */ > > * 1902, 1942: Should a counter be bumped? ACCEPT: Will bump oerrors > > * 1899-1907, 1939-1947: Could have a utility routine for this. ACCEPT > > * 2030: Worth ASSERT()ing that both ipha and ip6h aren't non-NULL > at the same time? ACCEPT > > * 2069: Would be worth explaining why we don't need to worry about > packets that may not have an IP header's worth of data in their > first mp. ACCEPT /* * Don't bother handling packets that don't have a full IP header in * the fist mblk. For the input path, the ip module ensures that this * won't happen, and on the output path, the IP tunneling MAC-type * plugins ensure that this also won't happen. */ > > * 2095: Why MBLKL() again rather than using first_mblkl? ACCEPT: No reason > > * 2097: What ensures this is safe -- e.g., what prevents > mp->b_cont->b_rptr from being a zero-byte message? Moreover, it > seems dangerous to provide the caller with a pointer to a header > that may be incomplete. ACCEPT: Will ensure that there's a complete inner header that inner* points to. > > * 2141: Ultra-nit: s/.)/)./ ACCEPT > > * 2151: ASSERT(MBLKL(data_mp >= 0)) would be useful. Also, an > explanation of why this is safe. ACCEPT: /* * The ip module ensures that ICMP errors contain at least the * original IP header (otherwise, the error would never have made it * here). */ > > * 2154: So something has already sanity-checked the inner packet > to ensure it's not garbage we'll misinterpret? Would be good to > explain those guarantees here. ACCEPT: See above. The error is here because ip sifted the protocol value out of the included IP header. > > * 2273: ASSERT(endptr < mp->b_wptr) would be useful, as would a > comment on why this is safe. I actually can't ASSERT() that. It doesn't look like ip_find_hdr_v6() guarantees that. It only guarantees that ipp_dstopts (or ipp_rtdstopts) points at the start of a destination options header. I need to return B_FALSE if I encounter this condition, and the code currently behaves that way due to the condition in the while loop. > > * 2276: Something guarantees that the resulting encaplimit option > entirely within the bounds of the packet, right? ACCEPT: Need to do this check against endptr. > > * 2300-2410: Same comments apply as iptun_input_icmp_v4(). ACCEPT > > * 2429: Per PSARC/1997/198, this should be norcvbuf, not ierrors. ACCEPT: I updated iptun_m_getstat() to include these, and fixed a number of other locations where we dropped packets due to conditions other than erroneous packets on both receive and transmit (noxmtbuf). > * 2439, 2444: Comment explaning why these casts are safe would be > useful. ACCEPT > > * 2507: Block comment explaining the various types of M_CTL/M_DATA > conventions expected here would be useful -- e.g., M_CTL with an > M_CTL b_cont is a protected ICMP packet, but M_CTL with an > M_DATA b_cont is just a protected IP packet? Or something else? > The reader shouldn't need to have a deep knowledge of these > obscure conventions to understand this code. ACCEPT: I've added the following to the intro. block comment for iptun_input(): * There are two kinds of packets that can arrive here: (1) IP-in-IP tunneled * packets and (2) ICMP errors containing IP-in-IP packets transmitted by us. * They have the following structure: * * 1) M_DATA * 2) M_CTL[->M_DATA] * * (2) Is an M_CTL optionally followed by M_DATA, where the M_CTL block is the * start of the actual ICMP packet (it doesn't contain any special control * information). * * Either (1) or (2) can be IPsec-protected, in which case an M_CTL block * containing an ipsec_in_t will have been prepended to either (1) or (2), * making a total of four combinations of possible mblk chains: * * A) (1) * B) (2) * C) M_CTL(ipsec_in_t)->(1) * D) M_CTL(ipsec_in_t)->(2) > > * 2546: Why are these casts needed? I knew you'd ask that when I added these! ;-) Both sides of the ':' need to have the same type, otherwise, the compiler complains: "../../common/inet/iptun/iptun.c", line 2589: operands have incompatible types: pointer to unsigned int ":" pointer to struct in6_addr {union {..} _S6_un} > > * 2665-2667: AKA `return (outer4->ipha_src != outer4->ipha_dst);' ACCEPT > > * 2679: How does this differ from iptun->iptun_typeinfo->iti_minmtu? iti_minmtu reflects the minimum MTU of the IP tunnel link, which can have either IPv4 or IPv6 interfaces on top of it. The minmtu variable here reflects the minimum IP MTU of the IP version we're actually transmitting with this packet. > > * 2788: Would be good to ASSERT() limit is inside mp (looks to be > valid based on the current implementation). ACCEPT > > * 2778: s/exceede/exceeded/ ACCEPT > > * 2841: Per PSARC/1997/198, this should be noxmtbuf, not oerrors. ACCEPT > > * 2863: Some of the reasons for mp == NULL should bump noxmtbuf, > not oerrors. ACCEPT: I've pushed the bumping of the stat into iptun_out_process_*(). > > * 2870: OK to access itp_flags without itp_lock held? Yes, I verified this with danmcd: If the ITPF_P_ACTIVE flag changes, this results in a harmless race whereby a packet may be transmitted using different policy than what was ultimately intended (which would have happened anyway if the packet had made it here a few ticks earlier). > > * 2871: Hoist ASSERT() to e.g. 2866? If this is not true, how is > the msgdsize() at 2895 safe? ACCEPT: I've moved this up to the top of iptun_output(). > > * 2881: What cases cause us to get here with b_next != NULL, and > why do those conditions only apply when IPsec policy is active? > Would be nice to add a comment explaining this. ACCEPT: Modified the comment at 2878 to: /* * ipsec_tun_outbound() returns a chain of tunneled IP * fragments linked with b_next (or a single message if the * tunneled packet wasn't a fragment). Each message in the * chain is prepended by an IPSEC_OUT M_CTL block with * instructions for outbound IPsec processing. */ > > * 2891: Where is this done? It's done by ip's processing of the packet in ip_wput_ire(). I've clarified the comment slightly: /* * The ip module will potentially apply global policy to the * packet in its output path if there's no active tunnel * policy. */ > > * 2921-2934: Would be more compact and clear to use C99 > initializers -- e.g.: > > static mac_callbacks_t iptun_m_callbacks = { > .mc_callbacks = (MC_SETPROP | MC_GETPROP), > .mc_getstat = iptun_m_getstat, > .mc_start = iptun_m_start, > .mc_stop = iptun_m_stop, > .mc_setpromisc = iptun_m_setpromisc, > .mc_multicst = iptun_m_multicst, > .mc_unicst = iptun_m_unicst, > .mc_tx = iptun_m_tx, > .mc_setprop = iptun_m_setprop, > .mc_getprop = iptun_m_getprop > }; ACCEPT Thanks, -Seb