We would like to start the IP datapath refactoring project.
We are requesting endorsement from the the networking community.

Thanks,
    Erik


-- OPENSOLARIS PROJECT PROPOSAL --


Project Name:
        IP Datapath Refactoring

Project Synopsis:
        Simplify the IP Datapaths to make them more understandable and evolvable


Project Purpose:

The IP datapaths are extemely hard to follow both at the micro level 
(ip_output_options and ip_wput_ire, and ip_input) and at the macro level 
(an outbound packet needing IPsec and ARP resolution goes through an odd 
number of steps).

That makes it hard to even fix bugs in that code, let alone getting it 
to perform. This has resulted improving performance by creating numerous 
fast paths, which are subsets of the full datapaths. This further makes 
maintenance of the code a hazardous activity.
The root cause of the complexity is that ip_newroute introduces 
asynchrony in the wrong part of the code. Tradionally ARP is done at the 
very bottom of the IP output side, but to avoid a separate ARP table 
lookup Solaris has an IRE_CACHE entry which is created to include the 
ARP information. This is done early in ip_output because the IRE_CACHE 
is also used to pick an IP source address in some cases (unconnected UDP 
and RAWIP sockets) and we need the IP source address early (before doing 
IPsec etc).

We need to move the ARP-related asynchrony to the bottom of IP output to 
get the output datapaths be more sane, and it also makes sense to 
disassociate source address selection from routing/IRE lookup. (In 1991 
the source address selection was simpler than today to the association 
made some sense. But with IPMP, IPv6, shared-IP zones etc the source 
address selection can't simply be associated with the route.)

A side effect of ip_newroute is that we need to carry various 
information from the transport protocols to the point after ip_newroute 
is done. We've created various ways to put this information in the 
messages so that they can be queued with the packets waiting for ARP 
resolution; the ip6i_t is there for this purpose as well as the 
ipsec_{in,out}_t which is currently used for more than just IPsec. There 
are also ad-hoc places we scribble information (b_prev, etc).

Note that the ip6i_t and M_CTL are also used to carry information 
between the transport protocols (for both the input and output path). 
But after Fireengine in S10 introduced direct function calls between the 
transports and IP we are no longer limited to passing a message using 
putnext. Hence we can relatively easily add function call arguments up 
and down between the transports and IP and have those function call 
arguments carry the meta-data associated with the packet (an example of 
meta-data is that on the receive side the transports need the incoming 
interface - the ill_t - to handle IP_RECVPKTINFO and IPv6 link-local 
addresses correctly.)

Having looked at the dependencies that unravel when ip_newroute is 
removed it turns out that the whole concept of IRE_CACHE isn't needed 
any more. We can do more efficient caching (and S10 already does for 
TCP) by caching the IRE and NCE (neighb or cache entry containing ARP 
information) in the conn_t.

This results in the removal of
        ip_newroute*
        IRE_CACHE
        ip6i_t
        M_CTL usage, including ipsec_out_t and ipsec_in_t
        Various b_prev usage in the ip_input side
and the addition of
        ip_xmit_attr_t - the transmit attributes passed to ip_output
        ip_recv_attr_t - receive attributes passed up to the ULP (and used 
internally
        in IP)
        A new way to track dependencies when IREs are added and removed
        Using nce_t for ARP information (we do this partially today; mostly for 
the
        IPv4 forwarding paths)

Current prototyping indicates that about 30,000 lines of code can be 
removed as a result of these changes (combined with the ARP/IP merge 
pieces).

The discussion will take place on the existing 
[email protected] list.


Proposed Community Sponsors:
     Networking


Participants:
     Project lead:
         Erik Nordmark

     Other Participants:
         Sowmini Varadhan
        Yunsong (Roamer) Lu
        Nitin Hande

   Other interested participants: please speak up.  We have some
   prototype code, and contributions of review time, bug fixes, or
   testing are very welcome; there's a lot of code changes here.

------
_______________________________________________
networking-discuss mailing list
[email protected]

Reply via email to