On Mon, Jun 12, 2017 at 15:11 +1000, David Gwynne wrote:
> On Fri, Jun 09, 2017 at 07:19:34PM +0200, Bj??rn Ketelaars wrote:
> > On Fri 09/06/2017 12:07, Martin Pieuchot wrote:
> > > On 08/06/17(Thu) 20:38, Bj??rn Ketelaars wrote:
> > > > On Thu 08/06/2017 16:55, Martin Pieuchot wrote:
> > > > > On 07/06/17(Wed) 09:43, Bj??rn Ketelaars wrote:
> > > > > > On Sat 03/06/2017 08:44, Bj??rn Ketelaars wrote:
> > > > > > > 
> > > > > > > Reverting back to the previous kernel fixed the issue above. 
> > > > > > > Question: can
> > > > > > > someone give a hint on how to track this issue?
> > > > > > 
> > > > > > After a bit of experimenting I'm able to reproduce the problem. 
> > > > > > Summary is
> > > > > > that queueing in pf and use of a current (after May 30), multi 
> > > > > > processor
> > > > > > kernel (bsd.mp from snapshots) causes these specific watchdog 
> > > > > > timeouts
> > > > > > followed by a system freeze.
> > > > > > 
> > > > > > Issue is 'gone' when:
> > > > > > 1.) using an older kernel (before May 30);
> > > > > > 2.) removal of queueing statements from pf.conf. Included below the 
> > > > > > specific
> > > > > >     snippet;
> > > > > > 3.) switch from MP kernel to SP kernel.
> > > > > > 
> > > > > > New observation is that while queueing, using a MP kernel, the 
> > > > > > download
> > > > > > bandwidth is only a fraction of what is expected. Exchanging the MP 
> > > > > > kernel
> > > > > > with a SP kernel restores the download bandwidth to expected level.
> > > > > > 
> > > > > > I'm guessing that this issue is related to recent work on PF?
> > > > > 
> > > > > It's certainly a problem in, or exposed by, re(4) with the recent MP 
> > > > > work
> > > > > in the network stack.
> > > > > 
> > > > > It would help if you could build a kernel with MP_LOCKDEBUG defined 
> > > > > and
> > > > > see if the resulting kernel enters ddb(4) instead of freezing.
> > > > > 
> > > > > Thanks,
> > > > > Martin
> > > > 
> > > > Thanks for the hint! It helped in entering ddb. I collected a bit of 
> > > > output,
> > > > which you can find below. If I read the trace correctly the crash is 
> > > > related
> > > > to line 1750 of sys/dev/ic/re.c:
> > > > 
> > > >         d->rl_cmdstat |= htole32(RL_TDESC_CMD_EOF);
> > > 
> > > Could you test the diff below, always with a MP_LOCKDEBUG kernel and
> > > tell us if you can reproduce the freeze or if the kernel enters ddb(4)?
> > > 
> > > Another question, how often do you see "watchdog timeout" messages?
> > > 
> > > Index: re.c
> > > ===================================================================
> > > RCS file: /cvs/src/sys/dev/ic/re.c,v
> > > retrieving revision 1.201
> > > diff -u -p -r1.201 re.c
> > > --- re.c  24 Jan 2017 03:57:34 -0000      1.201
> > > +++ re.c  9 Jun 2017 10:04:43 -0000
> > > @@ -2074,9 +2074,6 @@ re_watchdog(struct ifnet *ifp)
> > >   s = splnet();
> > >   printf("%s: watchdog timeout\n", sc->sc_dev.dv_xname);
> > >  
> > > - re_txeof(sc);
> > > - re_rxeof(sc);
> > > -
> > >   re_init(ifp);
> > >  
> > >   splx(s);
> > 
> > The diff (with a MP_LOCKDEBUG kernel) resulted in similar traces as before.
> > ddb Output is included below.
> > 
> > With your diff the number of timeout messages decreased from 9 to 2 before
> > entering ddb.
> 
> can you try the diff below please?
>

Oh my, unserialized start.  Someone didn't finish the conversion (-;
OK mikeb

> Index: hfsc.c
> ===================================================================
> RCS file: /cvs/src/sys/net/hfsc.c,v
> retrieving revision 1.39
> diff -u -p -r1.39 hfsc.c
> --- hfsc.c    8 May 2017 11:30:53 -0000       1.39
> +++ hfsc.c    12 Jun 2017 05:08:01 -0000
> @@ -817,7 +817,7 @@ hfsc_deferred(void *arg)
>       KASSERT(HFSC_ENABLED(ifq));
>  
>       if (!ifq_empty(ifq))
> -             (*ifp->if_qstart)(ifq);
> +             ifq_start(ifq);
>  
>       hif = ifq->ifq_q;
>  
> 

Reply via email to