On Fri, Jan 28, 2022 at 05:26:01PM +0100, Alexander Bluhm wrote:
> On Wed, Jan 26, 2022 at 11:05:51AM +0100, Claudio Jeker wrote:
> > On Wed, Jan 26, 2022 at 01:29:42AM +0100, Alexander Bluhm wrote:
> > > Hi,
> > > 
> > > There were some problems with ix(4) and ixl(4) hardware checksumming
> > > for the output path on strict alignment architectures.
> > > 
> > > I have merged jan@'s diffs and added some sanity checks and
> > > workarounds.
> > > 
> > > - If the first mbuf is not aligned or not contigous, use m_copydata()
> > >   to extract the IP, IPv6, TCP header.
> > > - If the header is in the first mbuf, use m_data for the fast path.
> > > - Add netstat counter for invalid header chains.  This makes
> > >   us aware when hardware checksumming fails.
> > > - Add netstat counter for header copies.  This indicates that
> > >   better storage allocation in the network stack is possible.
> > >   It also allows to recognize alignment problems on non-strict
> > >   architectures.
> > > - There is not risk of crashes on sparc64.
> > > 
> > > Does this aproach make sense?
> > 
> > I think it is overly complicated and too much data is copied around.
> > First of all there is no need to extract ipproto.
> > The code can just use the M_TCP_CSUM_OUT and M_UDP_CSUM_OUT flags (they
> > are not set for other protos).
> > Because of this only they ip_hlen needs to be accessed and this can be
> > done with m_getptr().
> 
> A solution with m_getptr() is where we started.  It has already
> been rejected.  The problem is that there are 6 ways to implement
> this feature.  Every option has its drawbacks and was rejected.
> 
> Options are:
> 1. m_getptr() and access the struct.  Alignment cannot be guaranteed.
> 2. m_getptr() and access the byte or word.  Header fields should be
>    accessed by structs.
> 3. Always m_copydata.  Too much overhead.
> 4. Always use m_data.  Kernel may crash or use invalid data.
> 5. Combination of m_data and m_copydata.  Too complex.
> 6. Track the fields in mbuf header.  Too fragile and memory overhead.
> 
> In my measurements checksum offloading gave us 10% performance boost
> so I want this feature.  Other drivers also have it.
> 
> Could we get another idea or find a consensus which option to use?

after staring at this for a few hours my conclusion is option 1 actually
is the right approach, but the diff for ixl has a bug.

> > In the IP6 case even more can be skipped since ip_hlen is static for IPv6.
> > 
> > In ixl(4) also the tcp header lenght needs to be extracted. Again the code
> > can be simplified because HW checksumming is only enabled if ip_hlen == 5
> > and so the offset of the th_off field is static (for both IPv4 and IPv6).
> > Again m_getptr can be used to just access the byte with th_off.
> > 
> > Longterm in_proto_cksum_out() should probably help provide the th_off
> > field. I think enforcing ip_hlen == 5 for UDP and TCP is fine, who needs
> > IP options on UDP and TCP?
> 
> Other diffs have been rejected as they make too many assumtions how
> the stack works.

my opinion is we can make these assumptions. the CSUM_OUT flags are
only set in very specific places where the stack itself is constructing
or checking properly aligned headers, and then the stack maintains
this alignment until it reaches the driver. this is where the first
of the bugs in the ixl diff comes in.

in the diff ixl_tx_setup_offload() is called after the dma mapping
occurs. this is implemented in this code:

static inline int
ixl_load_mbuf(bus_dma_tag_t dmat, bus_dmamap_t map, struct mbuf *m)
{
        int error;

        error = bus_dmamap_load_mbuf(dmat, map, m,
            BUS_DMA_STREAMING | BUS_DMA_NOWAIT);
        if (error != EFBIG)
                return (error);

        error = m_defrag(m, M_DONTWAIT);
        if (error != 0)
                return (error);

        return (bus_dmamap_load_mbuf(dmat, map, m,
            BUS_DMA_STREAMING | BUS_DMA_NOWAIT));
}

the problem is that when we get a heavily fragmented mbuf we call
m_defrag, which basically allocates a cluster and copies all the
data from the fragments into it. however, m_defrag does not maintain
the alignment of payload, meaning the ethernet header gets to start
on a 4 byte boundary cos that's what we get from the cluster pool,
and the IP and TCP payloads end up with a 2 byte misalignmnet.

my proposed solution to this is to set up the offload bits before
calling ixl_load_mbuf. i'm confident this is the bug that deraadt@ hit.

because of when the CSUM_OUT flags are set, i think m_getptr is fine for
pulling the mbuf apart. if it's not i want the code to blow up so it
gets fixed. that's why the code in my diff below lacks defenses.

the other bug is that the uint64_t holding the offload flags isn't
reset between being called for different mbufs. my solution to that is
to have ixl_tx_setup_offload() return the flags so the value in
ixl_start is overwritten every time.

lastly, ixl shouldn't (doesn't need to?) spelunk in the packet if
the CSUM_OUT flags arent set. ixl_tx_setup_offload() returns 0 early
if none of the flags are set now.

i'll have a look at ix and the ixl rx bits tomorrow.

Index: if_ixl.c
===================================================================
RCS file: /cvs/src/sys/dev/pci/if_ixl.c,v
retrieving revision 1.78
diff -u -p -r1.78 if_ixl.c
--- if_ixl.c    9 Jan 2022 05:42:54 -0000       1.78
+++ if_ixl.c    4 Feb 2022 11:26:16 -0000
@@ -82,6 +82,10 @@
 #endif
 
 #include <netinet/in.h>
+#include <netinet/ip.h>
+#include <netinet/ip6.h>
+#include <netinet/tcp.h>
+#include <netinet/udp.h>
 #include <netinet/if_ether.h>
 
 #include <dev/pci/pcireg.h>
@@ -2771,6 +2775,88 @@ ixl_load_mbuf(bus_dma_tag_t dmat, bus_dm
            BUS_DMA_STREAMING | BUS_DMA_NOWAIT));
 }
 
+static uint64_t
+ixl_tx_setup_offload(struct mbuf *m0)
+{
+       struct mbuf *m;
+       int hoff;
+       uint64_t hlen;
+       uint8_t ipproto;
+       uint64_t offload = 0;
+
+       if (!ISSET(m0->m_pkthdr.csum_flags,
+           M_IPV4_CSUM_OUT|M_TCP_CSUM_OUT|M_UDP_CSUM_OUT))
+               return (0);
+
+       switch (ntohs(mtod(m0, struct ether_header *)->ether_type)) {
+       case ETHERTYPE_IP: {
+               struct ip *ip;
+
+               m = m_getptr(m0, ETHER_HDR_LEN, &hoff);
+               KASSERT(m != NULL && m->m_len - hoff >= sizeof(*ip));
+               ip = (struct ip *)(mtod(m, caddr_t) + hoff);
+
+               offload |= ISSET(m0->m_pkthdr.csum_flags, M_IPV4_CSUM_OUT) ?
+                   IXL_TX_DESC_CMD_IIPT_IPV4_CSUM :
+                   IXL_TX_DESC_CMD_IIPT_IPV4;
+ 
+               hlen = ip->ip_hl << 2;
+               ipproto = ip->ip_p;
+               break;
+       }
+
+#ifdef INET6
+       case ETHERTYPE_IPV6: {
+               struct ip6_hdr *ip6;
+
+               m = m_getptr(m0, ETHER_HDR_LEN, &hoff);
+               KASSERT(m != NULL && m->m_len - hoff >= sizeof(*ip6));
+               ip6 = (struct ip6_hdr *)(mtod(m, caddr_t) + hoff);
+ 
+               offload |= IXL_TX_DESC_CMD_IIPT_IPV6;
+
+               hlen = sizeof(*ip6);
+               ipproto = ip6->ip6_nxt;
+               break;
+       }
+#endif
+       default:
+               panic("CSUM_OUT set for non-IP packet");
+               /* NOTREACHED */
+       }
+
+       offload |= (ETHER_HDR_LEN >> 1) << IXL_TX_DESC_MACLEN_SHIFT;
+       offload |= (hlen >> 2) << IXL_TX_DESC_IPLEN_SHIFT;
+
+       switch (ipproto) {
+       case IPPROTO_TCP: {
+               struct tcphdr *th;
+
+               if (!ISSET(m0->m_pkthdr.csum_flags, M_TCP_CSUM_OUT))
+                       break;
+
+               m = m_getptr(m, hoff + hlen, &hoff);
+               KASSERT(m != NULL && m->m_len - hoff >= sizeof(*th));
+               th = (struct tcphdr *)(mtod(m, caddr_t) + hoff);
+ 
+               offload |= IXL_TX_DESC_CMD_L4T_EOFT_TCP;
+               offload |= (uint64_t)th->th_off << IXL_TX_DESC_L4LEN_SHIFT;
+               break;
+       }
+
+       case IPPROTO_UDP:
+               if (!ISSET(m0->m_pkthdr.csum_flags, M_UDP_CSUM_OUT))
+                       break;
+ 
+               offload |= IXL_TX_DESC_CMD_L4T_EOFT_UDP;
+               offload |= (sizeof(struct udphdr) >> 2) <<
+                   IXL_TX_DESC_L4LEN_SHIFT;
+               break;
+       }
+
+       return (offload);
+}
+
 static void
 ixl_start(struct ifqueue *ifq)
 {
@@ -2785,6 +2871,7 @@ ixl_start(struct ifqueue *ifq)
        unsigned int prod, free, last, i;
        unsigned int mask;
        int post = 0;
+       uint64_t offload;
 #if NBPFILTER > 0
        caddr_t if_bpf;
 #endif
@@ -2816,6 +2903,8 @@ ixl_start(struct ifqueue *ifq)
                if (m == NULL)
                        break;
 
+               offload = ixl_tx_setup_offload(m);
+
                txm = &txr->txr_maps[prod];
                map = txm->txm_map;
 
@@ -2834,6 +2923,7 @@ ixl_start(struct ifqueue *ifq)
                        cmd = (uint64_t)map->dm_segs[i].ds_len <<
                            IXL_TX_DESC_BSIZE_SHIFT;
                        cmd |= IXL_TX_DESC_DTYPE_DATA | IXL_TX_DESC_CMD_ICRC;
+                       cmd |= offload;
 
                        htolem64(&txd->addr, map->dm_segs[i].ds_addr);
                        htolem64(&txd->cmd, cmd);

Reply via email to