Re: file system deadlock in RELENG_11
On 25/08/2017 05:40, Mike Tancsa wrote: On 8/24/2017 4:01 PM, Mike Tancsa wrote: OK, this is fairly easy to repeat. If I start a sync of a snapshot via zrep, it hangs the box. CTRL+T shows DEBUG: overiding stale lock on zroot/chyves from pid 19378 sending zroot/chyves@zrep_10 to 10.151.9.2:zroot/chyves cannot receive new filesystem stream: destination 'zroot/chyves/guests/resi/disk1' exists must specify -F to overwrite it Are you sending a snapshot from the host to the guest? or is the guest sending to the host? load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 358.94r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 360.42r 0.00u 0.00s 0% I was able to get ps to run and not sure if its helpful or not, but these are the two unkillable zfs processes root 5 0.0 0.0 0128 - DL 10:09 0:05.58 [zfskern] root 29683 0.0 0.07752 3824 5 DE+ 15:53 0:00.32 zfs send -R -I zroot/chyves@zrep_0d zroot/chyves@zrep_10 root 29690 0.0 0.07752 3492 5 D+ 15:53 0:00.00 zfs rename zroot/chyves@zrep_10 zroot/chyves@zrep_10_unsent Should be unrelated but several years ago there was a deadlock issue when renaming a zvol, unless the fix got undone recently this may just be a distraction or a hint of a solution... https://svnweb.freebsd.org/base?view=revision=272474 -- FreeBSD - the place to B...Storing Data Shane Ambler ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: file system deadlock in RELENG_11
On 8/24/2017 4:01 PM, Mike Tancsa wrote: > OK, this is fairly easy to repeat. If I start a sync of a snapshot via > zrep, it hangs the box. CTRL+T shows > > > DEBUG: overiding stale lock on zroot/chyves from pid 19378 > sending zroot/chyves@zrep_10 to 10.151.9.2:zroot/chyves > cannot receive new filesystem stream: destination > 'zroot/chyves/guests/resi/disk1' exists > must specify -F to overwrite it > ^C > load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 358.94r 0.00u 0.00s 0% > 3476k > load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 360.42r 0.00u 0.00s 0% I was able to get ps to run and not sure if its helpful or not, but these are the two unkillable zfs processes root 5 0.0 0.0 0128 - DL 10:09 0:05.58 [zfskern] root 29683 0.0 0.07752 3824 5 DE+ 15:53 0:00.32 zfs send -R -I zroot/chyves@zrep_0d zroot/chyves@zrep_10 root 29690 0.0 0.07752 3492 5 D+ 15:53 0:00.00 zfs rename zroot/chyves@zrep_10 zroot/chyves@zrep_10_unsent -- --- Mike Tancsa, tel +1 519 651 3400 Sentex Communications, m...@sentex.net Providing Internet services since 1994 www.sentex.net Cambridge, Ontario Canada http://www.tancsa.com/ ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: file system deadlock in RELENG_11
OK, this is fairly easy to repeat. If I start a sync of a snapshot via zrep, it hangs the box. CTRL+T shows DEBUG: overiding stale lock on zroot/chyves from pid 19378 sending zroot/chyves@zrep_10 to 10.151.9.2:zroot/chyves cannot receive new filesystem stream: destination 'zroot/chyves/guests/resi/disk1' exists must specify -F to overwrite it ^C load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 358.94r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 360.42r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 360.79r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 360.99r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 361.19r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 361.37r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 361.55r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 361.74r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 361.92r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 362.11r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 362.31r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 362.50r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 362.69r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 362.90r 0.00u 0.00s 0% 3476k load: 0.52 cmd: zfs 29690 [tx->tx_sync_done_cv] 363.08r 0.00u 0.00s 0% 3476k On 8/24/2017 11:48 AM, Mike Tancsa wrote: > I upgraded a server yesterday from RELENG_11 from march 2017 to r322800 > (Aug 22) and noticed that under heavy disk IO in a VM, the server is > locking up. In the vm, I was doing a large untar and I noticed that > prior to the lockup, the hypervisor would be struggling to keep up the > disk writes. The VM is on a zvol if that makes any difference. A few > times in the VM, IO would be clogged to the point that the disk would > timeout in the VM > > Aug 24 08:32:02 kernel: ahcich6: Timeout on slot 14 port 0 > Aug 24 08:32:02 kernel: ahcich6: is cs ss rs > tfd 50 serr cmd 0001db17 > Aug 24 08:32:02 kernel: (ada1:ahcich6:0:0:0): WRITE_FPDMA_QUEUED. ACB: > 61 00 a8 47 d8 40 01 00 00 01 00 00 > Aug 24 08:32:02 kernel: (ada1:ahcich6:0:0:0): CAM status: Command timeout > Aug 24 08:32:02 kernel: (ada1:ahcich6:0:0:0): Retrying command > > When the parent deadlocks, I cant run anything thats not already in RAM. > shutdown doesnt work and I have to reboot the box via IPMI. > > Any ideas how to debug this or try and better understand the problem so > I can at least work around it ? > > ---Mike > -- --- Mike Tancsa, tel +1 519 651 3400 Sentex Communications, m...@sentex.net Providing Internet services since 1994 www.sentex.net Cambridge, Ontario Canada http://www.tancsa.com/ ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
file system deadlock in RELENG_11
I upgraded a server yesterday from RELENG_11 from march 2017 to r322800 (Aug 22) and noticed that under heavy disk IO in a VM, the server is locking up. In the vm, I was doing a large untar and I noticed that prior to the lockup, the hypervisor would be struggling to keep up the disk writes. The VM is on a zvol if that makes any difference. A few times in the VM, IO would be clogged to the point that the disk would timeout in the VM Aug 24 08:32:02 kernel: ahcich6: Timeout on slot 14 port 0 Aug 24 08:32:02 kernel: ahcich6: is cs ss rs tfd 50 serr cmd 0001db17 Aug 24 08:32:02 kernel: (ada1:ahcich6:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 a8 47 d8 40 01 00 00 01 00 00 Aug 24 08:32:02 kernel: (ada1:ahcich6:0:0:0): CAM status: Command timeout Aug 24 08:32:02 kernel: (ada1:ahcich6:0:0:0): Retrying command When the parent deadlocks, I cant run anything thats not already in RAM. shutdown doesnt work and I have to reboot the box via IPMI. Any ideas how to debug this or try and better understand the problem so I can at least work around it ? ---Mike -- --- Mike Tancsa, tel +1 519 651 3400 Sentex Communications, m...@sentex.net Providing Internet services since 1994 www.sentex.net Cambridge, Ontario Canada http://www.tancsa.com/ ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"