> I'd like to ask for a first round of review for the ipobs bits please.
> I've posted the webrev here:
>
> http://cr.opensolaris.org/~philk/ipobs/
>
> It can be found internally here including cscope:
>
> /net/exorcist.uk/storage/builds/ipobs-only/webrev/
Phil,
My first round of feedback is below. Note that I only was able to
allocate 2 days for this, so I wasn't able to get through everything, but
hopefully it'll get things started. Apologies for mixing in trivialities
like formatting issues alongside possible panics in my feedback, but I
figured I should report everything that caught my eye.
General:
* The data structures and algorithms in ipnet.c need immediate
attention to ensure acceptable performance. See my comment
for ipnet.c.
* Similarly, there are a number of changes to performance-critical
codepaths where e.g. IRE lookups are done just for ipnet but
without checking whether ipnet is in-use. These must be reworked.
* Careful attention needs to be paid to context and locking.
I found a number of places where locks were held across
callbacks or put(), or where sleeping allocations were done
from contexts that cannot block, but I may have missed some.
* Why are ipnet and lo being installed as both a STREAMS module
and a driver? From the source code, they appear to only be
used as drivers. Seems like the relevant packaging files
should be updated, and the Makefiles should be simplified to
only build them as drivers.
* Please look for all places where we need to account for the new
DL_IPNET type. For instance, dladm_media2str() should handle
DL_IPNET, as should dl_mactypestr() and dlpi_mactype().
* It seems unfortunate to not be able to make use of the GLDv3
framework for the ipnet and lo devices, though a case can be
made that they're special enough to warrant being standalone
(e.g, we'd others need to hide them from show-link), provided we
can keep the source simple.
cmd/cmd-inet/usr.sbin/snoop/snoop.c:
* Shouldn't -d and -I be mutually exclusive? I'd expect an
error to result if both are used.
* 84: For consistency with line 83, remove "= 0".
* 772, 776: I'd prefer to rework this message so that -d
is described as "listen on named link", and -I
is described as "listen on named IP interface".
cmd/cmd-inet/usr.sbin/snoop/snoop.h:
* 284: Comment needs to be updated to document new fields.
(Also, why did we add these fields if they always have
the value ETHERTYPE_IP/ETHERTYPE_IPV6? Is this something
that was intended to be added as part of IP tunneling?)
cmd/cmd-inet/usr.sbin/snoop/snoop_capture.c:
* 56: What necessitated the <limits.h> #include?
* 93, 198: Please, no new externs in .c files!
* 156-160: Would prefer to use a variable for the flags to
pass to dlpi_open() and OR in DLPI_DEVIPNET.
* 201, 240-244: As per libdlpi comments below, would prefer
to have this done via dlpi_open() flags.
* 446-449: Seems like this was added to deal with a wraparound
case, but I'm not sure why di_max_sdu would ever be larger
than 2^32-64. Please explain.
cmd/cmd-inet/usr.sbin/snoop/snoop_ether.c:
* 51: What necessitated the <limits.h> #include?
* 55-63: Seems like the BE_64() macro in <sys/byteorder.h>
should already do what you want.
* 64-66: This looks like a mismerge where a number of IP tunneling
functions have crept in.
* 76-98: Likewise, looks like IP tunneling changes?
* 1607, 1616-1617, 1636-1644: I'd expect us to generally translate
the zone ID into a zone name and also include the zone ID, like
we do with hostnames. (Seems like we'd need a way to turn off
that mapping for the case where snoop is reading from a capture
file that was generated on another machine, though.)
* See subsequent comments about dli_type.
* 1614, 1634, 1674: Why is the `ip_type' variable needed?
* 1615: What does ETHERMTU have to do with this code? And what
do lines 1622-1626 guard against?
* 1618-1619: Hoist assignments up to line 1612-1613.
* 1661: Seems to be an extra argument to this function.
* 1672: What is the point of this memcpy()? For Ethernet, we do
this to ensure the data is 32-bit aligned -- but here we don't
seem to be adjusting `off' so it seems the data must already
be aligned. Thus, why can't we just skip the memcpy() and pass
`off' directly into interpret_ip()/interpret_ipv6()?
cmd/cmd-inet/usr.sbin/snoop/snoop_filter.c:
* 923-958: This is just code cleanup, right? I don't see any
substantive changes here.
* 1342-1343: There naked 0x0104 and 0x0106 constants are hard
to understand. Would prefer something like
(DL_IPNETINFO_VERSION << 8 | IPV4_VERSION), even if it leads
to line wraps.
* 1363-1379: Code changes here would be simpler if *mtp was
renamed to match_types[], and then a stack-local
"match_type_t *mtp = match_types[index]" was added.
* 1398: Are we confident that there are no datalink types that
report something other than DL_ETHER but still have packets that
look like Ethernet such that the matching should be applied?
cmd/truss/codes/codes.c:
* 991: Move the ioctl to be near the existing DL* ioctl printing
code.
lib/libdlpi/common/libdlpi.h:
* 66: s/link/IP DLPI link/ in the comment.
lib/libdlpi/common/libdlpi.c:
* 134: Restore blank line.
* 986: I'm confused by what's meant by "or "/dev/ip" node".
* 1086: Restore blank line.
* dlpi_walk() changes are still needed.
* Seems like there should be some way to specify that
DL_IOC_IPNETINFO is desired via dlpi_open(), like we do for
DLPI_NATIVE and DLPI_RAW. Maybe a DLPI_IPNETINFO flag?
* Some of the flag combinations are invalid and should lead to
errors rather than implementation artifacts (at least for
publically defined flags). For instance, DLPI_DEVONLY |
DLPI_DEVIPNET should be an error.
lib/libsecdb/exec_attr.txt:
* 202: With this change, we're asserting that the only privilege
snoop needs on Solaris 10 and later is net_observability (since
I believe this will override the "suser" entry on line 201).
But I thought PRIV_NET_OBSERVABILITY only worked for /dev/ipnet
and /dev/lo0 -- so don't we also need PRIV_NET_RAWACCESS in the
Network Management case (and perhaps other privileges)?
* 203: This line should be after line 204 so that all the
Network Management lines remain grouped together.
lib/libsecdb/prof_attr.txt:
* 58: Need a help=foo.html entry (and the associated html file).
pkgdefs/SUNWckr/prototype_{i386,sparc}:
* See previous comments about ipnet/lo being only drivers, not
STREAMS modules.
uts/common/fs/dev/sdev_ipnetops.c:
* 29-32: This comment is clearly incorrect (references pts).
Please update it to provide an overview of the functionality,
similar to the one in sdev_netops.c
* 44: Do we actually use vfs_opreg.h?
* 50: ipnet_if_getby_name() prototype needs to be moved to
<inet/ipnet.h>
* 52-56: Remove; nothing calls this function.
* 61: Remove unused ARGSUSED directive.
* 65: Remove and just use dv->sdev_name directly in the call
to ipnet_if_getby_name().
* 78: Remove ARGSUSED and remove unused arguments from the
function signature.
* 88: What guarantees that `ipn' won't go away while we're
using it? Seems like the lookup needs to acquire a hold, but
how long does the ipnetif_t's dev_t need to stay valid for?
* 89, 102, 144: Remove blank line.
* 90, 92, 197, 199: Remove braces.
* 94-101, 162-168: Need a helper function to share this logic.
* 148: Function name isn't appropriate; recommend
devpinet_filldir_entry()
* 150: Remove cast.
* 152: Needless initialization.
* 184: It doesn't seem like devipnet_filldir() will correctly
handle removing dead entries. I think we need to walk the
list of sdev nodes and prune out the dead ones like devnet
(and devpts) do.
* 184: Separately, since this is a heavyweight operation and the
set of ipnet nodes will rarely change, consider an approach
similar to the SDEV_BUILD / dls_devnet_build() logic so you can
use the cached sdev nodes whenever possible.
* 205-207: Update comments to match code.
uts/common/inet/ip_impl.h:
* 510: Explicitly test != NULL.
* 514: Should the third argument really be NULL? Seems like
it'd be MBLK_GETLABEL(mp).
* 515: If sire != NULL and ire_type == IRE_CACHE, we forget to
ire_refrele().
uts/common/inet/ip_stack.h:
* 418: Custom is to cite the corresponding .c file.
* 419-420: Fix indenting.
uts/common/inet/ip/ip_multi.c:
* 86, 1331: I'm not convinced this is the function you want (see
ipnet.c comments), but assuming it is, you need to remove line
86 and place the function prototype into ip_multi.h. Also, if
you do end up keeping this change, it'd be good to update the
block comment above te function, and the corresponding comment
in ip_wput_ctl(): the SAP is not allocated in ill_create_dl(),
and therefore ip_wput_ctl() does not strip it.
uts/common/inet/ip/ip_if.c:
* 18570: No need for comment.
* 22009: s/a NE_SET_ZONE/an NE_SET_ZONE/
uts/common/inet/ip/ip_netinfo.c:
* Convention is that ip_stack_t arguments should be named `ipst'.
* 103, 105, 1278, 1311: Shouldn't these be `static'?
* 1285, 1317: Why not just pass in `isv6' as an argument instead
of the family?
* 1321: B_FALSE is not a uint64_t. See neti.h comments.
* 1323: Referencing ipif_flags after releasing ipif could lead to
panic. Need to save ipif_flags in a uint64_t before releasing.
uts/common/inet/arp/arp_netinfo.c:
* 66-67: Seems like the convention is to have routines fail
explicitly if they're not supported (e.g., arp_lifgetnext()).
uts/common/Makefile.files:
* 928, 965: Looks like a mismerge with SMB here.
uts/common/sys/dlpi.h:
* 259: DL_IPNET is not part of the DLPI standard, so it needs to
have a constant above 0x80000000 as per the spec (see the block
starting at line 264.
* I expected to see DL_IOC_IPNETINFO here, rather than buried off
in <inet/ipnet.h>. Is there a reason it shouldn't be moved to
here? (If so, please clean up places like truss/codes.c since
they'll no longer need to #include <inet/ipnet.h>.) Also,
DLIOCIPNETINFO seems more consistent with the other "DLPI"
ioctls. (Yes, we'd need to submit a PSARC fasttrack to cover
these changes.)
uts/common/sys/neti.h:
* 128, 207: net_getlif_flags() name seems a bit inconsistent --
expected net_getlifflags() (I realize we ARC'd the other
name, but it's not too late to change.)
* 207: net_getlif_flags() signature isn't sufficient since there's
no way to return an error. Instead, the flags field probably
needs to be passed as a pointer, and the return value needs to
be used to report errors.
uts/common/io/neti.c:
* 409, 418: Remove needless casts, and reformat code to minimize
line breaks.
uts/common/inet/ipnet.h:
* 33: Comment here could be more informative. Further, each of
the structures should have comments explaining the its purpose,
the purpose of each field (perhaps inline into the structure
definition), and locking rules.
* 48-49: See earlier comments on the location of DL_IOC_IPNETINFO;
same applies to DL_IPNETINFO_VERSION and dl_ipnetinfo{,_t}.
More generally, this header should not be needed by any
applications and as such should be guarded by _KERNEL.
* 51-53: As before, I don't see a need for these and the
IPVx_VERSION macros.
* 60: Please use a precise typedef so that lint can do proper
function pointer checking.
* 68, 76: I'm a bit confused about ihd_family and dli_type. First,
I'd expect these fields to have similar names and be the same
width (right now, one is 16 bits, the other is 8) since they're
representing the same concept. Second, it our PSARC materials
called this field dli_ipver, which seems on-point. Is there a
reason that name is no longer appropriate? Third, assuming it
is the version number, can we just use the existing IPV4_VERSION
and IPV6_VERSION constants? (They should probably be move to
a public header file.)
* 71: Seems like ihd_fastpath should be a boolean.
* 99: What is the point of ifa_status? Seems to be set but
never used.
* 100: Likewise, ifa_zone appears to be set but never used.
* 101: Should ifa_id be of type lif_if_t?
* 110: Do we need `if_minor' given that it's equivalent to
getminor(if_dev)?
* 111: Would be simpler to change `*if_name' to if_name[LIFNAMSIZ]
and remove the accompanying dynamic allocation.
* 120: Add missing `extern' and remove `cb' argument declaration.
* 122: For clarity, argument names should match the names of the
ihd fields -- mp, htype, zsrc, zdst, ...
* 122-148: Need to parenthesize macro arguments for safety.
* 124: Explicitly check against NULL.
* 134: Need to handle dupmsg() failure -- fallback to copymsg(),
then bail out entirely. Also, no need for `ipnetmp1' -- can
directly assign to ihd_mp.
* 144: Please correct me if I've missed something here, but it
seems like we'll do a fair amount of allocations here before
we've even determined that the callback wants the packet.
Minimally, I think we should have the callback indicate what
ihd_htype it's interested in so that e.g. lo_input() doesn't
end up throwing away tons of packets it's not interested in.
* 144: I'm very nervous about invoking a callback while locks
(ips_ipnet_cb_lock) are held. Would prefer to:
* Have an ips_ipnet_cb_nwalkers field which indicates
the number of threads walking ips_ipnet_cb_list.
* Have ips_ipnet_cb_nwalkers changed and checked for
under the lock.
* Have ips_{un,}register_hook() wait on a
ips_ipnet_cb_cv condition variable if nwalkers is
non-zero, then update ips_ipnet_cb_list while holding
the lock to be sure no new walkers can appear.
* 144: As per above, please make sure that none of the
IPNET_HOOK() call sites are holding locks.
* 150: The IPNET_HOOK_* enum should be named so that debuggers
can make use of it (ipnet_hook_type_t?), and ihd_htype should
be updated to use that type.
uts/common/inet/lo/lo.c:
* Many of these comments also apply to ipnet.c. (In general,
it'd be nice to get better sharing between those two source
files, but it might make the code more confusing.)
* Various hardcoded uses of 6 need to be changed to a constant
(probably ETHERADDRL).
* 30: Would be nice to expand upon this comment a bit with
an overview of the provided functionality.
* 69: Why can't lo_input be `static'?
* 70-71: Remove argument names.
* 103-104: NULL members can be elided.
* 107: D_NEW does nothing and can be removed. Moreover, just
embed these flags directly in the DDI_DEFINE_STREAM_OPS()
invocation.
* 107: Are we sure D_MTPERMOD is OK here? e.g., if this really
was used for loopback traffic (rather than just siphoning off
IP's traffic via lo_input()) it seems like it could be an issue.
I see loopback functionality in lowput(), but that appears to
just be there so that lo_input() can send packets downstream
instead of upstream (see comments on that later).
* 122: Remove unnecessary cast.
* 125: #define of IP_LOOPBACK_MTU needs to be shared from a
header file (I see there's already one in ip_if.h, though
perhaps it needs to be moved to somewhere else). Also, the
one in ip_if.h is 8K, whereas this one is 8K + 60 bytes.
Why the difference?
* 162: Please add comments for these file-wide variables.
Also, do we really need ip_cb_set? It seems to mirror
lostr_head != NULL.
* 165: Seems like it'd be more useful to the drops for each
lostr_t -- i.e., maybe just a lodrops field in the lostr_t?
* 167-175: Seems like the BE_64() macro in <sys/byteorder.h>
should already do what you want.
* 177-179: Remove unused constants.
* 180-181: Rename to LOIPNETRAW/LOIPNETINFO.
* 182: Remove blank line.
* 190, 204: Standardize on either `rc' or `error'.
* 192-193, 224-225, 283-284: Merge onto single line.
* 221, 223: Remove braces.
* 229-230: No need to print a warning or to call
ddi_remove_minor_node(), but you do need to return DDI_FAILURE.
* 232: Upon success, I'd expect us to save the devi in `lodevi'
so that the information can be returned by
DDI_INFO_DEVT2DEVINFO.
* 236: No need for /* ARGSUSED */
* 253: `register'?
* 257: Need to implement DDI_INFO_DEVT2DEVINFO, as per above.
* 275, 287, 292: Remove `minor' and assign directly to lominor.
* 332-341: Move to line 345.
* 370: lo_iocdata() would be clearer.
* 386: Remove needless cast.
* 388: Hoist initialization to line 385.
* 398-408: Why do we need both TRANSPARENT and non-TRANSPARENT
support for DL_IOC_IPNETINFO? Seems like this could be replaced
with a `goto miocnak'.
* 422: Remove needless cast.
* 425: Hoist initialization to line 422.
* 450-451: Hoist initializations to 447-448.
* 473: Seems odd to support DL_PROMISCON_REQ not but
DL_PROMISCOFF_REQ (I realize snoop sends only the former,
but that seems like an implementation artifact).
* 499: Remove blank line.
* 500: This seems to leave the physical address and broadcast
address values pointing to garbage. Assuming this means that
no one is relying on that information, it'd seem simpler to
just set dl_brdcast_addr_offset and dl_addr_offset to zero,
and *just* allocate a dl_info_ack_t without the trailing gorp.
* 547: `head' seems a strange name given that it rarely points
to the head.
* 553: dupmsg() can fail. Need to fallback to copymsg() -- and
that can also fail. However, more importantly, in the common
case where there's only a single lostr_t, why is another
allocation needed at all?
* 560: `ourmsg' is leaked upon failure.
* 577: Why put() instead of putnext()? Seems like we're wasting
time going through lowput().
* 551-583: Restructuring this to a `for' loop and reworking
the formatting would improve readability considerably -- e.g.
(incorporating issues above, but not the suggestion to filter
ihd_type in IPNET_HOOK()):
for (; lostr != NULL; lostr = lostr->lonext) {
if (ihd->ihd_htype != IPNET_HOOK_LOCAL)
continue;
if (lostr->lonext == NULL) {
netmp = ihd->ihd_mp;
ihd->ihd_mp = NULL;
} else {
if ((netmp = dupmsg(ihd->ihd_mp)) == NULL &&
(netmp = copymsg(ihd->ihd_mp)) == NULL) {
lo_drops++;
continue;
}
}
if (lostr->lomode & IPNETINFO) {
dl_ipnetinfo_t *dl;
hmp = allocb(sizeof (dl_ipnetinfo_t), BPRI_HI);
if (hmp == NULL) {
freemsg(netmp);
lo_drops++;
continue;
}
dl = (dl_ipnetinfo_t *)hmp->b_rptr;
dl->dli_version = DL_IPNETINFO_VERSION;
dl->dli_len = htons(sizeof (*dl));
dl->dli_type = ihd->ihd_family;
dl->dli_srczone = BE_64((uint64_t)ihd->ihd_zsrc);
dl->dli_dstzone = BE_64((uint64_t)ihd->ihd_zdst);
hmp->b_wptr += sizeof (*dl);
hmp->b_cont = netmp;
netmp = hmp;
}
putnext(RD(lostr->lowq), netmp);
}
* 593: Add blank line to separate declaration from code.
* 601: Remove needless cast.
uts/common/inet/ipnet.c:
[ Only a very high-level review of this file has been done since
it will be affected by other suggested changes. I've only
brought up general things that need to be thought about or
worked through before another round of review, and weren't
covered by other comments. ]
* ipnetif_arr[] and the functions around it need to be reworked;
it's unreasonable to have every lookup have to walk through
2048 entries looking for a match. Would suggest using the
kernel AVL data structures like we do for phyint_t's (e.g.
see phyint_list_by_{name,index}). Alternatively, it'd be worth
thinking about whether this data structure can be directly tied
to an ill_t so that no explicit lookup is needed, but that would
probably mean tighter coupling with the rest of IP
* If we're going to lookup things by interface index, we need a way
to track some joker changing the interface index via ifconfig.
* 65: Not sure what DLPI_ANY_SAP is needed for; the DLPI spec
doesn't seem to allow such a case.
* 377-434, 442-484: Considerable duplication between these two
calls; factor into a function.
* 498: Would it be hard to allow detach?
* 942-991: I'm very suspicious of this DL_ENABMULTI_REQ /
DL_DISABMULTI_REQ logic since we never fill in the physical
multicast address being joined. What are we trying to
accomplish here? Note that ip_join_allmulti() uses
DL_PROMISCON_REQ/DL_PROMISC_MULTI to tell the driver to
receive all layer 2 multicast traffic; is that what's
desired? Or something else?
* 1022-1128: In addition to the issues that overlap with the
lo_input() issues raised above, it looks like this code is
holding locks across put(), which will lead to deadlock.
(However, I can't see why the lock is needed given that
ipnet is D_MTPERMOD and ipnet_head only changes from
ipnet_open() and ipnet_close()).
* 1134: Using DDI_SLEEP here seems unworkable; we may be in
interrupt context.
* 1383-1493: ipnet_addrchg_ev() needs a lot of work. First,
there's a massive amount of duplication. Second, I'm completely
confused by the `name' logic, which appears to kmem_alloc() a
buffer only to later kmem_free() it (and possibly crash in the
middle if memory isn't available.
* 1495-1509: Are we planning to do something with interface
up/down events?
* 1511-1558: I'd think we'd want to do something with the
address up/down events, but as per my earlier comments on
ifa_status, I don't see anything that uses this information.
* 1560-1635: Similar comments to ipnet_addrchg_ev(), with the
addition that I don't see anything using the ifa_zone field
that this function sets.
* 1638-1678: Remove loads of spurious braces.
* 1680-1696: See previous comments regarding performance.
This brute-force algorithm won't suffice, especially given
that only a few ipnet nodes will be in-use at a given time.
uts/common/inet/tcp.h:
* 301: Comment new field. Also, insert a blank line after tcp_lso
to preserve existing convention of visually grouping each nibble.
uts/common/inet/tcp.c:
* 19221-19231: This IRE lookup adds an unreasonable performance
penalty in the case where ipnet_ips_cb_list is NULL. Minimally,
need to find an elegant way to limit it to the case where we
have ipnet listeners, and ideally need to eliminate it entirely.
uts/common/inet/udp.c:
* 86: Why is <sys/ethernet.h> needed?
* 6359-6368: See performance comments for tcp.c. Also, given that
this code exists in a few places (tcp.c, ip_impl.h), need to
factor it out.
uts/{intel,sparc}/os/device_policy (line numbers correspond to sparc):
* 77: What's the difference between "ipnet:*" and "ipnet"?
* 77: Add missing tab so that entries line up.
* 79: Add blank line.
uts/{intel,sparc}/os/name_to_major:
* Where's the entry for `lo'?
* Why is there a gap in the number sequence between the previous
highest number and the number for `ipnet'?
uts/{intel,sparc}/lo/Makefile (line numbers correspond to sparc):
* 48, 63, 105-108: As per general comments, why do we need the
$(ROOTLINK) stuff here? Seems like this should only be a
STREAMS driver.
uts/{intel,sparc}/ipnet/Makefile (line numbers correspond to sparc):
* 48, 63, 111-114: Same comment as lo/Makefile.
uts/i86pc/os/startup.c:
* 1440: Could you explain this change?
uts/intel/Makefile.intel.shared:
* 51-52: What uses these IPLO_DRV and IPNET_DRV definitions?
uts/sparc/Makefile.sparc.shared:
* 49-50: Likewise for IPLODRV_DIR and IPNETDRV_DIR.
pkgdefs/SUNWckr/prototype_com:
pkgdefs/SUNWcsd/prototype_com:
pkgdefs/SUNWhea/prototype_com:
pkgdefs/common_files/i.minorperm_i386:
pkgdefs/common_files/i.minorperm_sparc:
Targetdirs:
lib/brand/native/zone/platform.xml:
lib/brand/sn1/zone/platform.xml:
uts/common/os/priv_defs:
uts/common/sys/netstack.h:
uts/common/os/priv_defs:
uts/common/sys/hook_event.h
uts/common/sys/fs/sdev_impl.h:
uts/common/Makefile.rules:
uts/common/fs/dev/sdev_subr.c.s:
uts/common/inet/ip.h:
uts/common/inet/Makefile:
uts/common/inet/lo/lo.conf:
uts/common/inet/ipnet/ipnet.conf:
uts/{intel,sparc}/dev/Makefile:
* Reviewed, no comments.
cmd/cmd-inet/usr.sbin/snoop/snoop_pf.c:
uts/common/inet/ip/ip.c:
uts/common/inet/ip/ip6.c:
uts/common/inet/tcp/tcp_fusion.c:
* Skipped for now.
--
meem