[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2017-02-21 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #31 from Rick Macklem  ---
And thanks go to you for doing the testing and commits.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2017-02-21 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

Andriy Gapon  changed:

   What|Removed |Added

 Status|In Progress |Closed
 Resolution|--- |FIXED

--- Comment #30 from Andriy Gapon  ---
Rick, thank you again!

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2017-02-21 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #29 from commit-h...@freebsd.org ---
A commit references this bug:

Author: avg
Date: Tue Feb 21 09:29:47 UTC 2017
New revision: 314034
URL: https://svnweb.freebsd.org/changeset/base/314034

Log:
  MFC r313735: add svcpool_close to handle killed nfsd threads

  PR:   204340
  Reported by:  Panzura
  Reviewed by:  rmacklem
  Approved by:  rmacklem

Changes:
_U  stable/10/
  stable/10/sys/fs/nfsserver/nfs_nfsdkrpc.c
  stable/10/sys/rpc/svc.c
  stable/10/sys/rpc/svc.h

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2017-02-21 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #28 from commit-h...@freebsd.org ---
A commit references this bug:

Author: avg
Date: Tue Feb 21 09:29:11 UTC 2017
New revision: 314033
URL: https://svnweb.freebsd.org/changeset/base/314033

Log:
  MFC r313735: add svcpool_close to handle killed nfsd threads

  PR:   204340
  Reported by:  Panzura
  Approved by:  rmacklem
  Obtained from:rmacklem

Changes:
_U  stable/11/
  stable/11/sys/fs/nfsserver/nfs_nfsdkrpc.c
  stable/11/sys/rpc/svc.c
  stable/11/sys/rpc/svc.h

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2017-02-14 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #27 from commit-h...@freebsd.org ---
A commit references this bug:

Author: avg
Date: Tue Feb 14 17:49:08 UTC 2017
New revision: 313735
URL: https://svnweb.freebsd.org/changeset/base/313735

Log:
  add svcpool_close to handle killed nfsd threads

  This patch adds a new function to the server krpc called
  svcpool_close().  It is similar to svcpool_destroy(), but does not free
  the data structures, so that the pool can be used again.

  This function is then used instead of svcpool_destroy(),
  svcpool_create() when the nfsd threads are killed.

  PR:   204340
  Reported by:  Panzura
  Approved by:  rmacklem
  Obtained from:rmacklem
  MFC after:1 week

Changes:
  head/sys/fs/nfsserver/nfs_nfsdkrpc.c
  head/sys/rpc/svc.c
  head/sys/rpc/svc.h

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2017-02-07 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #26 from Andriy Gapon  ---
(In reply to Rick Macklem from comment #25)
Rick, thank you very much!
We tested this patch and it makes nfsd more robust with respect to SIGKILL.
We haven't seen any regressions.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2017-02-05 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

Rick Macklem  changed:

   What|Removed |Added

 Attachment #179512|0   |1
is obsolete||

--- Comment #25 from Rick Macklem  ---
Created attachment 179661
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=179661=edit
add svcpool_close (cleaned up version)

This patch implements the same logic as 179512, but is cleaned up
by factoring out the code common to svcpool_destroy() and svcpool_close()
and placing it in svcpool_cleanup().

Semantically equivalent to 179512 for testing.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2017-02-01 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

Rick Macklem  changed:

   What|Removed |Added

 Status|Closed  |In Progress
 Resolution|FIXED   |---

--- Comment #24 from Rick Macklem  ---
Please test the patch I just attached. (4th one) I think it might make the
code less fragile to nfsd threads being signalled. I have not been able
to create a crash with the patch during limited testing.

Since avg@'s crash occurred with the other patches, I have reopened the
PR.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2017-02-01 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #23 from Rick Macklem  ---
Created attachment 179512
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=179512=edit
add svcpool_close so that svcpool_destroy doesn't get called when nfsd threads
are killed

This patch adds a new function to the server krpc called svcpool_close().
It is similar to svcpool_destroy(), but does not free the data structures,
so that the pool can be used again.

This function is then used instead of svcpool_destroy(), svcpool_create()
when the nfsd threads are killed.

These crashes are caused because the data structures were free'd by
svcpool_destroy() when the nfsd threads were killed off (or signalled somehow).
By avoiding the svcpool_destroy() call, the crashes should be avoided.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2017-01-31 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #22 from Andriy Gapon  ---
(In reply to Rick Macklem from comment #20)

Rick, thank you very much for the explanation!  I knew that nfsd processes were
special as they 'lend their stacks to kernel' or something like that.  But I
failed to realise that that put restrictions in the signals as well.

I should also explain that kill -9 was not used to shutdown nfsd or as a
replacement for the normal nfsd management.  It was used just to demonstrate
the problem.

I think that originally the problem happened when gdb was used on an nfsd
process.

I understand that the nfsd processes are special.  But the situation seems to
be a bit fragile.  The current design is old and proven.  But perhaps we could
switch to using kernel processes or maybe we could mark the nfsd processes with
a special flag somehow as to prevent them being killed SIGKILL or stopped with
SIGSTOP (i.e. prevent normal signal delivery for all signals).

Lastly, just to clarify, should we avoid using debuggers / SIGSTOP with nfsd?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2017-01-30 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #21 from Rick Macklem  ---
Oh, and if you get a crash when shutting down the nfsd threads via
"kill -USR1 ", then let us know, since something is
still broken.
(If you are curious, the "(master)" nfsd is the one that gets new
 TCP connections and it must stop doing that before the kernel
 threads are terminated. Otherwise you can easily get a socket
 upcall after the data structures have been free'd.)

rick

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2017-01-30 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #20 from Rick Macklem  ---
Well, you should never kill the nfsd with SIGKILL. The way
that is intended to be "safe" is to send SIGUSR1 to the "(master)"
nfsd. It will shut things down and kill off the other threads.

I'm not sure that I know of any way to make SIGKILL safe.
I'm pretty sure "man nfsd" says this, but maybe it needs
more emphasis?

I also believe that /etc/rc.d/nfsd does this correctly.

rick

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2017-01-30 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #19 from Andriy Gapon  ---
(In reply to Rick Macklem from comment #18)
> Andriy Gapon, did your crash occur when the machine was being shut down
(or the nfsd threads were being killed off)?

Yes, it did.  I am told that this was easy to reproduce by SIGKILL to nfsd.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2017-01-29 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #18 from Rick Macklem  ---
Andriy Gapon, did your crash occur when the machine was being shut down
(or the nfsd threads were being killed off)?

If not, it is not caused by what these patches were intended for.
(They are in 10.3.)

Your crash basically indicates that either:
1 - The sg_group was free'd when a socket upcall was still in progress.
Since the sg_group structures aren't free'd until the nfsd threads
are killed (shutdown or ??), I don't think this can happen during
normal operation.
OR
2 - The xprt structure that referenced the sg_group was free'd prematurely
and the sg pointer was bogus.

If it was #2, I think I can come up with a simple patch to avoid this.
(Basically acquire a reference count on the xprt structure during the
 socket upcall.)

rick

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2017-01-27 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

Andriy Gapon  changed:

   What|Removed |Added

 CC||a...@freebsd.org

--- Comment #17 from Andriy Gapon  ---
Rick,
we got what looks like a very similar crash in FreeBSD 10.3:

db_trace_self_wrapper+0x2a
kdb_backtrace+0x37
vpanic+0xf7
panic+0x67
trap_fatal+0x264
trap_pfault+0x216
trap+0x32b
calltrap+0x8
__mtx_lock_sleep+0xa2
xprt_active+0xe7
svc_vc_soupcall+0x25
sowakeup+0x69
tcp_do_segment+0x319e
tcp_input+0x701
ip_input+0x14c
netisr_dispatch_src+0x228
ether_demux+0x1a5
ether_nh_input+0x1fc
netisr_dispatch_src+0x228
tcp_lro_flush+0x2b
ixgbe_rxeof+0x30d
ixgbe_msix_que+0x88
intr_event_execute_handlers+0x102
ithread_loop+0x9a
fork_exit+0x11f
fork_trampoline+0xe

It's this lock:
void
xprt_active(SVCXPRT *xprt)  
{   
SVCGROUP *grp = xprt->xp_group; 

mtx_lock(>sg_lock);

Do you have any suggestions?
Thank you!

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-12-05 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

Rick Macklem  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|In Progress |Closed

--- Comment #16 from Rick Macklem  ---
The patch that fixes this has been committed to head as r291150 and
stable/10 as r291869.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-12-05 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

Rick Macklem  changed:

   What|Removed |Added

  Flags|mfc-stable10?   |mfc-stable10+

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-21 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #12 from g_amana...@yahoo.com ---
The three last patches resolve the bug.
@rick thank you.
@koobs MFC?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-21 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #14 from Rick Macklem  ---
Also, although it was good to get this crash resolved, I would recommend
trying to avoid doing nfsd thread restarts. (If by any chance you were doing
this to avoid access problems during export/mount updates, the "-S" option
on mount suspends/resumes the nfsd threads, which should be preferable to
killing them off and restarting them.)

rick

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-21 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

Rick Macklem  changed:

   What|Removed |Added

  Flags|mfc-stable9?|mfc-stable9-

--- Comment #13 from Rick Macklem  ---
A variant of the first patch has already been committed to head by mav@.
He also noted to me that the second patch isn't needed. The "else" case
doesn't have any other thread manipulating the list, so the thread mutex
isn't needed.

Since you are using NFSv4.1, I'm pretty sure that the third patch is the
one that fixes the problem. I will commit this to head soon and MFC it
before closing this PR. (Oh, and the NFSv4.1 server stuff isn't in stable/9,
so the fix only applies to 10 and head.)

Thanks for doing the testing, so this could get resolved.

rick

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-21 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #15 from commit-h...@freebsd.org ---
A commit references this bug:

Author: rmacklem
Date: Sat Nov 21 23:55:46 UTC 2015
New revision: 291150
URL: https://svnweb.freebsd.org/changeset/base/291150

Log:
  When the nfsd threads are terminated, the NFSv4 server state
  (opens, locks, etc) is retained, which I believe is correct behaviour.
  However, for NFSv4.1, the server also retained a reference to the xprt
  (RPC transport socket structure) for the backchannel. This caused
  svcpool_destroy() to not call SVC_DESTROY() for the xprt and allowed
  a socket upcall to occur after the mutexes in the svcpool were destroyed,
  causing a crash.
  This patch fixes the code so that the backchannel xprt structure is
  dereferenced just before svcpool_destroy() is called, so the code
  does do an SVC_DESTROY() on the xprt, which shuts down the socket upcall.

  Tested by:g_amana...@yahoo.com
  PR:204340
  MFC after:2 weeks

Changes:
  head/sys/fs/nfs/nfs_var.h
  head/sys/fs/nfsserver/nfs_nfsdkrpc.c
  head/sys/fs/nfsserver/nfs_nfsdstate.c

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-19 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #11 from g_amana...@yahoo.com ---
I am currently testing with the three last patches applied as I do have NFSv4.1
clients running on Linux. Give me a couple of days and I will provide feedback.
Thank you for the insight.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-19 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

Mark Linimon  changed:

   What|Removed |Added

   Keywords||patch

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-18 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #9 from Rick Macklem  ---
Created attachment 163300
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=163300=edit
patch that makes NFSv4.1 server release the backchannel xprt

This patch fixes the NFSv4.1 server so that it SVC_RELEASE()s the
backchannel xprt when the nfsd threads are terminated.
This patch only affects NFSv4.1, so it doesn't matter unless you
are running NFSv4.1 client mounts.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-18 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #8 from Rick Macklem  ---
Created attachment 163299
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=163299=edit
patch that locks mutex when request queue is updated

This patch fixes the only other issue I can spot in the kernel
server RPC. There was a case where the request queue was being
updated, but the st_lock mutex was not held. This bug could have
conceivably resulted in a corrupted request queue.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-18 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #10 from Rick Macklem  ---
I have just added 2 more patches that might be relevant to the crashes.
When the nfsd threads are terminated, this is what is supposed to happen:
- All nfsd threads running in svc_run_internal() return to svc_run().
- svc_run() waits for all these threads to return.
- After svc_run returns, the nfsd calls svcpool_destroy().
- svcpool_destroy() unregisters all the xprts (which represent the TCP sockets)
  - at this point, the reference count should be 1 for all xprts
  --> Then svcpool_destroy() calls SVC_RELEASE(xprt) for all of them, which
  drops the reference count to 0 and calls SVC_DESTROY()
 --> This actually calls svc_vc_destroy(), which shuts down the socket
upcall
 and after that, destroys the mutexes.

My best guess w.r.t. the crashes is that the reference count gets messed up on
an xprt, so it doesn't get SVC_DESTROY()'d. Then a socket upcall calls
xprt_active() after the mutex has been destroyed and BOOM.

The two patched should be applied along with the first one.
The second patch fixes the one other place that I can spot where the server
side krpc code isn't quite SMP safe. Although unlikely, it is conceivable
that this could cause the crashes.

The third patch makes sure that the backchannel xprt is dereferenced before
the call to svcpool_destroy(). The one seems a more likely culprit, but only
if you have clients doing NFSv4.1 mounts against the server.

If you could try the second patch (and the third if you have NFSv4.1 mounts),
that would be appreciated.

One final comment: I am assuming that you are terminating the nfsd threads
by sending a SIGUSR1 to the nfsd master. This is the only way the nfsd
threads should be terminated. (If you are using /etc/rc.d/nfsd, it should
be doing that, but you might try using "kill -USR1 "
directly, just in case the shell script is busted.

This pretty well exhausts what I can see that might cause the crashes and
I can't reproduce a crash here, so hopefully you can make some progress
from here.

Good luck with it, rick

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-17 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #5 from g_amana...@yahoo.com ---
(In reply to Rick Macklem from comment #4)
I am testing it, so far it seems ok. Give me a couple of days to test it
thoroughly.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-17 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #6 from g_amana...@yahoo.com ---
(In reply to Rick Macklem from comment #4)
It still crashes. I am going to get a dump tomorrow.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-17 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #7 from g_amana...@yahoo.com ---
This a fresh dump with the second patch applied, it seems the same. The problem
seems to occur in xprt_active().

Unread portion of the kernel message buffer:
stack pointer   = 0x28:0xfe01ee7ee430
frame pointer   = 0x28:0xfe01ee7ee4b0
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process = 12 (irq266: em1:rx0)
trap number = 9
panic: general protection fault
cpuid = 0
KDB: stack backtrace:
#0 0x80984e30 at kdb_backtrace+0x60
#1 0x809489e6 at vpanic+0x126
#2 0x809488b3 at panic+0x43
#3 0x80d4ab6b at trap_fatal+0x36b
#4 0x80d4a7ec at trap+0x75c
#5 0x80d30882 at calltrap+0x8
#6 0x80b4a725 at xprt_active+0x45
#7 0x80b4e185 at svc_vc_soupcall+0x35
#8 0x809bcc52 at sowakeup+0x82
#9 0x80aea942 at tcp_do_segment+0x2b22
#10 0x80ae7720 at tcp_input+0x12b0
#11 0x80a77f57 at ip_input+0x97
#12 0x80a177d2 at netisr_dispatch_src+0x62
#13 0x80a0eb76 at ether_demux+0x126
#14 0x80a0f81e at ether_nh_input+0x35e
#15 0x80a177d2 at netisr_dispatch_src+0x62
#16 0x804e121b at em_rxeof+0x2eb
#17 0x804e1663 at em_msix_rx+0x33
Uptime: 10h5m21s
Dumping 3079 out of 8134 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

Reading symbols from /boot/kernel/zfs.ko.symbols...done.
Loaded symbols for /boot/kernel/zfs.ko.symbols
Reading symbols from /boot/kernel/opensolaris.ko.symbols...done.
Loaded symbols for /boot/kernel/opensolaris.ko.symbols
Reading symbols from /boot/kernel/if_tap.ko.symbols...done.
Loaded symbols for /boot/kernel/if_tap.ko.symbols
Reading symbols from /boot/kernel/if_bridge.ko.symbols...done.
Loaded symbols for /boot/kernel/if_bridge.ko.symbols
Reading symbols from /boot/kernel/bridgestp.ko.symbols...done.
Loaded symbols for /boot/kernel/bridgestp.ko.symbols
Reading symbols from /boot/kernel/aio.ko.symbols...done.
Loaded symbols for /boot/kernel/aio.ko.symbols
Reading symbols from /boot/kernel/coretemp.ko.symbols...done.
Loaded symbols for /boot/kernel/coretemp.ko.symbols
Reading symbols from /boot/kernel/ipmi.ko.symbols...done.
Loaded symbols for /boot/kernel/ipmi.ko.symbols
Reading symbols from /boot/kernel/smbus.ko.symbols...done.
Loaded symbols for /boot/kernel/smbus.ko.symbols
Reading symbols from /boot/kernel/aesni.ko.symbols...done.
Loaded symbols for /boot/kernel/aesni.ko.symbols
Reading symbols from /boot/kernel/crypto.ko.symbols...done.
Loaded symbols for /boot/kernel/crypto.ko.symbols
Reading symbols from /boot/kernel/mps.ko.symbols...done.
Loaded symbols for /boot/kernel/mps.ko.symbols
Reading symbols from /boot/kernel/vmm.ko.symbols...done.
Loaded symbols for /boot/kernel/vmm.ko.symbols
Reading symbols from /boot/kernel/nmdm.ko.symbols...done.
Loaded symbols for /boot/kernel/nmdm.ko.symbols
Reading symbols from /boot/kernel/geom_eli.ko.symbols...done.
Loaded symbols for /boot/kernel/geom_eli.ko.symbols
Reading symbols from /boot/kernel/ums.ko.symbols...done.
Loaded symbols for /boot/kernel/ums.ko.symbols
Reading symbols from /boot/kernel/ipfw.ko.symbols...done.
Loaded symbols for /boot/kernel/ipfw.ko.symbols
Reading symbols from /boot/kernel/dummynet.ko.symbols...done.
Loaded symbols for /boot/kernel/dummynet.ko.symbols
Reading symbols from /boot/kernel/ipfw_nat.ko.symbols...done.
Loaded symbols for /boot/kernel/ipfw_nat.ko.symbols
Reading symbols from /boot/kernel/libalias.ko.symbols...done.
Loaded symbols for /boot/kernel/libalias.ko.symbols
Reading symbols from /boot/kernel/ctl.ko.symbols...done.
Loaded symbols for /boot/kernel/ctl.ko.symbols
Reading symbols from /boot/kernel/iscsi.ko.symbols...done.
Loaded symbols for /boot/kernel/iscsi.ko.symbols
Reading symbols from /boot/kernel/nullfs.ko.symbols...done.
Loaded symbols for /boot/kernel/nullfs.ko.symbols
Reading symbols from /boot/kernel/fdescfs.ko.symbols...done.
Loaded symbols for /boot/kernel/fdescfs.ko.symbols
#0  doadump (textdump=) at pcpu.h:219
219 pcpu.h: No such file or directory.
in pcpu.h
(kgdb) list *0xfe01ee7ee430
No source file for address 0xfe01ee7ee430.
(kgdb) backtrace
#0  doadump (textdump=) at pcpu.h:219
#1  0x80948642 in kern_reboot (howto=260)
at /usr/src/sys/kern/kern_shutdown.c:451
#2  0x80948a25 in vpanic (fmt=, 
ap=) at /usr/src/sys/kern/kern_shutdown.c:758
#3  0x809488b3 in panic (fmt=0x0) at
/usr/src/sys/kern/kern_shutdown.c:687
#4  0x80d4ab6b in trap_fatal (frame=, 
eva=) at /usr/src/sys/amd64/amd64/trap.c:851
#5  0x80d4a7ec in trap (frame=)
at /usr/src/sys/amd64/amd64/trap.c:203
#6  0x80d30882 in calltrap () at
/usr/src/sys/amd64/amd64/exception.S:236
#7  0x8092e980 in __mtx_lock_sleep 

[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-16 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

Kubilay Kocak  changed:

   What|Removed |Added

  Flags||mfc-stable9?, mfc-stable10?
   Keywords||crash

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-16 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #3 from g_amana...@yahoo.com ---
(In reply to Rick Macklem from comment #2)
Thanks for the insight. However, I just tried the patch proposed and it still
crashes.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-16 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

Rick Macklem  changed:

   What|Removed |Added

 Attachment #163160|0   |1
is obsolete||

--- Comment #4 from Rick Macklem  ---
Created attachment 163217
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=163217=edit
Patch to fix sg_threadcount++ so the mutex is held when done

I took a closer look at svc.c and the only way that the race
I think might have happened can occur is if svc_run() doesn't
wait for all threads to terminate. I also can now see that the
first patch wouldn't have fixed anything, so I'm not surprised
it didn't work.

The only thing I can see that is broken in the code and might
allow svc_run() to return before all threads have terminated is:
- The thread count sg_threadcount is incremented when the sg_lock
  mutex isn't held.
  --> This could conceivably result in a corrupted sg_threadcount,
  which would allow svc_run() to return before all threads have
  terminated.

This second patch fixes the code so that sg_threadcount++ is always
done when the sg_lock mutex is held.

If you can test this one instead of the last one, that would be
appreciated. I do know this patch fixes the above problem, but I
don't know if this is the cause of your crashes.

Also, this bug seems to have existed in the code forever and all
that r267228 did was switch from not holding the pool mutex to
not holding the group mutex. So Alexander, you are off the hook,
I think.;-) I've left you on the cc, since you probably know this
code better than anyone else and might have some insight w.r.t. this crash.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-15 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

Rick Macklem  changed:

   What|Removed |Added

 Status|New |In Progress
 CC||rmack...@freebsd.org
   Assignee|freebsd-bugs@FreeBSD.org|rmack...@freebsd.org

--- Comment #2 from Rick Macklem  ---
Created attachment 163160
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=163160=edit
patch that might fix this problem

I think this crash might have been caused by a race
between svcpool_destroy() and the socket upcall.
The code in svcpool_destroy() assumes that SVC_RELEASE(xprt)
drops the ref cnt to 0, so that SVC_DESTROY() is called.
-->SVC_DESTROY() shuts down the socket upcall.
--> If the ref cnt doesn't go to 0, svcpool_destroy() will
mtx_destroy() the mutexes prematurely.

I am not sure, but the race might have been introduced by
r267228 since, prior to this there was a single mutex for
the pool, held while all xprt's are unregistered.
After r267228, there is a group of mutexes, where the code
only held one at a time, so I think an xprt might get re-registered
on another group after that group has had all de-registered.

The attached little patch moves the mtx_lock() calls to a
separate loop before the xprt_unregister loops, so that all
locks are held while all are de-registered.

I've added mav@ to the cc list, since he might be the guy
that actually understands this.

Anyhow, if you could test the attached patch with msi interrupts
re-enabled and see if the crashes go away, that would be great.
(I don't think that this indicates that the em(4) driver is broken.
 I suspect that it just affects timing of the interrupts that tripped
 over this race.)

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-14 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

Sean Bruno  changed:

   What|Removed |Added

   Keywords||IntelNetworking
 CC||sbr...@freebsd.org

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-07 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

g_amana...@yahoo.com changed:

   What|Removed |Added

 CC||freebsd-bugs@FreeBSD.org

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-06 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

Bug ID: 204340
   Summary: [panic] nfsd, em, msix, fatal trap 9
   Product: Base System
   Version: 10.2-RELEASE
  Hardware: Any
OS: Any
Status: New
  Severity: Affects Only Me
  Priority: ---
 Component: kern
  Assignee: freebsd-bugs@FreeBSD.org
  Reporter: g_amana...@yahoo.com

On a X9SCM with em NICs restarting nfsd with mounted shares remotely results in
kernel trap 9 with panic and reboot. The system is running a GENERIC kernel on
an up-to-date FreeBSD 10.2-RELEASE. 

On /var/log/messages:

1406 Nov  5 16:43:00 x3200 kernel: 
1407 Nov  5 16:43:00 x3200 kernel: 
1408 Nov  5 16:43:00 x3200 kernel: Fatal trap 9: general protection fault while
in kernel mode
1409 Nov  5 16:43:00 x3200 kernel: cpuid = 0; apic id = 00
1410 Nov  5 16:43:00 x3200 kernel: instruction pointer   =
0x20:0x8092e980
1411 Nov  5 16:43:00 x3200 kernel: stack pointer =
0x28:0xfe01ee7ee430
1412 Nov  5 16:43:00 x3200 kernel: frame pointer =
0x28:0xfe01ee7ee4b0
1413 Nov  5 16:43:00 x3200 kernel: code segment  = base 0x0, limit
0xf, type 0x1b
1414 Nov  5 16:43:00 x3200 kernel: = DPL 0, pres 1, long 1, def32 0, gran 1
1415 Nov  5 16:43:00 x3200 kernel: processor eflags  =
1416 Nov  5 16:43:00 x3200 kernel: interrupt enabled, resume, IOPL = 0
1417 Nov  5 16:43:00 x3200 kernel: current process   = 12 (irq266:
em1:rx0)
1418 Nov  5 16:45:06 x3200 syslogd: kernel boot file is /boot/kernel/kernel
1419 Nov  5 16:45:06 x3200 kernel: trap number   = 9
1420 Nov  5 16:45:06 x3200 kernel: panic: general protection fault
1421 Nov  5 16:45:06 x3200 kernel: cpuid = 0
1422 Nov  5 16:45:06 x3200 kernel: KDB: stack backtrace:
1423 Nov  5 16:45:06 x3200 kernel: #0 0x80984e30 at kdb_backtrace+0x60
1424 Nov  5 16:45:06 x3200 kernel: #1 0x809489e6 at vpanic+0x126
1425 Nov  5 16:45:06 x3200 kernel: #2 0x809488b3 at panic+0x43
1426 Nov  5 16:45:06 x3200 kernel: #3 0x80d4aadb at trap_fatal+0x36b
1427 Nov  5 16:45:06 x3200 kernel: #4 0x80d4a75c at trap+0x75c
1428 Nov  5 16:45:06 x3200 kernel: #5 0x80d307f2 at calltrap+0x8
1429 Nov  5 16:45:06 x3200 kernel: #6 0x80b4a725 at xprt_active+0x45
1430 Nov  5 16:45:06 x3200 kernel: #7 0x80b4e0a5 at
svc_vc_soupcall+0x35
1431 Nov  5 16:45:06 x3200 kernel: #8 0x809bcc52 at sowakeup+0x82
1432 Nov  5 16:45:06 x3200 kernel: #9 0x80aea942 at
tcp_do_segment+0x2b22
1433 Nov  5 16:45:06 x3200 kernel: #10 0x80ae7720 at tcp_input+0x12b0
1434 Nov  5 16:45:06 x3200 kernel: #11 0x80a77f57 at ip_input+0x97
1435 Nov  5 16:45:06 x3200 kernel: #12 0x80a177d2 at
netisr_dispatch_src+0x62
1436 Nov  5 16:45:06 x3200 kernel: #13 0x80a0eb76 at ether_demux+0x126
1437 Nov  5 16:45:06 x3200 kernel: #14 0x80a0f81e at
ether_nh_input+0x35e
1438 Nov  5 16:45:06 x3200 kernel: #15 0x80a177d2 at
netisr_dispatch_src+0x62
1439 Nov  5 16:45:06 x3200 kernel: #16 0x804e121b at em_rxeof+0x2eb
1440 Nov  5 16:45:06 x3200 kernel: #17 0x804e1663 at em_msix_rx+0x33
1441 Nov  5 16:45:06 x3200 kernel: Uptime: 2m10s
1442 Nov  5 16:45:06 x3200 kernel: Dumping 460 out of 8134
MB:..4%..11%..21%..32%..42%..53%..63%..73%..84%..91%Copyright (c) 1992-201
5 The FreeBSD Project.


Running kgdb:
(kgdb) list *0x8092e980
0x8092e980 is in __mtx_lock_sleep (/usr/src/sys/kern/kern_mutex.c:438).
433  * owner stops running or the state of the lock
changes.
434  */
435 v = m->mtx_lock;
436 if (v != MTX_UNOWNED) {
437 owner = (struct thread *)(v & ~MTX_FLAGMASK);
438 if (TD_IS_RUNNING(owner)) {
439 if (LOCK_LOG_TEST(>lock_object, 0))
440 CTR3(KTR_LOCK,
441 "%s: spinning on %p held by
%p",
442 __func__, m, owner);

(kgdb) backtrace
#0  doadump (textdump=) at pcpu.h:219
#1  0x80948642 in kern_reboot (howto=260) at
/usr/src/sys/kern/kern_shutdown.c:451
#2  0x80948a25 in vpanic (fmt=, ap=) at /usr/src/sys/kern/kern_shutdown.c:758
#3  0x809488b3 in panic (fmt=0x0) at
/usr/src/sys/kern/kern_shutdown.c:687
#4  0x80d4aadb in trap_fatal (frame=, eva=) at /usr/src/sys/amd64/amd64/trap.c:851
#5  0x80d4a75c in trap (frame=) at
/usr/src/sys/amd64/amd64/trap.c:203
#6  0x80d307f2 in calltrap () at
/usr/src/sys/amd64/amd64/exception.S:236
#7  0x8092e980 in __mtx_lock_sleep (c=0xfee09198,
tid=18446735277688524800, opts=0, file=0x0, line=0)
at /usr/src/sys/kern/kern_mutex.c:437
#8  0x80b4a725 in xprt_active 

[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-06 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

g_amana...@yahoo.com changed:

   What|Removed |Added

   Hardware|Any |amd64
 CC||g_amana...@yahoo.com

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"


[Bug 204340] [panic] nfsd, em, msix, fatal trap 9

2015-11-06 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204340

--- Comment #1 from g_amana...@yahoo.com ---
Setting hw.em.enable_msix=0 in bootloader.conf resolves the symptoms.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"