On 2016-01-31 at 17:35 Dan Cross <[email protected]> wrote:
> Can you double-check with a git log that you didn't miss anything?  I
> > found two commits at least:
> >
> > ffa08811d0d6 ("82536: removes receive buffer pools")
> >
> 
> I just looked at this again. The old receive buffer pools code that
> the commit you referred to changed was substantially different than
> what was in the newly ported driver; it wasn't clear to me that this
> was a problem and I didn't want to diverge too far from the Plan 9
> driver.

Here's the deadlock: all of the rbufs can be pushed into the network
stack, waiting for someone to read, and then the app itself blocks on a
write.  The write blocks, waiting on an ACK.  The ACK never arrives,
since we are out of rbufs.  There's the circular wait that deadlocks
the system until the network stack starts timing out.

The old plan 9 driver had 3 * 512 rbufs.  The new one has 1024.

I dont' see how the new driver avoids this problem.  The new plan 9
driver seems to just panic now.  So I guess we won't have a mystery on
our hands when it happens - we'll just have a dead machine.  In the off
chance anyone at UCB runs kweb and httperf on c89, they probably will
trigger this bug.


> and
> >
> > 1795dce1b1e6 ("82563/i350: Fixes off-by-one issue")
> >
> 
> These changes were brought in, though in a different way than in this
> change. In the above referenced commits, this was effected by setting
> a temporary to the value of a structure member minus 1. However, that
> structure member did not exist in the (new) Plan 9 driver, was only
> set to a constant in the old driver anyway; I implemented the effect
> of the change in the above mentioned commit by changing the loop
> index to -1 the constant value. In other words, the effect of the
> change was brought in, but done in a way that more closely matched
> the new driver's extant structure.

OK.  Here's a few diffs I have from master to your branch for that file.
I'll limit it to places where NEXT_RING is used:

-   m = c->ntd;
-   while (c->tdba[n = NEXT_RING(tdh, m)].status & Tdd) {
+   struct block *bp;
+   int tdh, n;
+
+   tdh = ctlr->tdh;
+   while (ctlr->tdba[n = NEXT_RING(tdh, Ntd - 1)].status & Tdd) {
 
and later

-   m = ctlr->ntd;
+   qlock(&ctlr->tlock);

-   i82563txinit(ctlr);
+   /*
+    * Free any completed packets
+    */
+   tdh = i82563cleanup(ctlr);

-   for (;;) {
-       tdh = i82563cleanup(edev);
+   /* if link down on 218, don't try since we need k1fix to run first */
+   if (!edev->link && ctlr->type == i218 && !ctlr->didk1fix) {
+       qunlock(&ctlr->tlock);
+       return;
+   }

-       if (NEXT_RING(tdt, m) == tdh) {
+   /*
+    * Try to fill the ring back up.
+    */
+   tdt = ctlr->tdt;
+   for (;;) {
+       if (NEXT_RING(tdt, Ntd - 1) == tdh) {   /* ring full? */

-       tdt = NEXT_RING(tdt, m);
+       tdt = NEXT_RING(tdt, Ntd - 1);
+   }

In those cases, NEXT_RING appears to be taking Ntd - 1, instead of Ntd,
effectively undoing the commit 1795dce1b1e6 ("82563/i350: Fixes
off-by-one issue").  The original bug fix I did was to change "Ntd - 1"
-> "Ntd".   It appears that your branch undoes that.  

Barret

-- 
You received this message because you are subscribed to the Google Groups 
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to