I'm investigating this issue, and built a kernel with the following two
patches:

a) 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7776db1ccc1
 
b) A debug patch present in 
http://lists.infradead.org/pipermail/linux-nvme/2017-February/008498.html 

The idea of the first patch, which was merged upstream in Linux 4.12, is to 
poll the completion 
queue of the device in the event of a timeout - if it succeeds, means that the 
device didn't post a completion, so could be an adapter issue. 

The idea of the 2nd patch is just to provide debug information in case of a 
mismatch in the choice 
of the blk-mq hw queue in nvme driver - it's a debug patch proposed in the 
mailing list to address a similar bug report in the past. 

The kernel with the debug patches is available in PPA - to install it,
one can follow the below instructions:

a) sudo add-apt-repository ppa:gpiccoli/test-nvme-182638 
b) sudo apt-get update 
c) sudo apt-get install linux-image-4.4.0-1073-aws 

After installation is complete, please reboot the instance and after it's 
restarted, 
check "uname -rv" output, which should be: 

"4.4.0-1073-aws #83+hf182638v20181129b1-Ubuntu SMP Fri Nov 30 17:09:30
UTC 2018"

Please notice this is a test kernel, shouldn't be used in any production 
environment, nor is 
officially supported in any form.

Anybody that can test this, much appreciated. Please post the complete dmesg 
after/if the issue is triggered.
Thanks,


Guilherme

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1788035

Title:
  nvme: avoid cqe corruption

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Xenial:
  Fix Released

Bug description:
  To address customer-reported NVMe issue with instance types (notably
  c5 and m5) that expose EBS volumes as NVMe devices, this commit from
  mainline v4.6 should be backported to Xenial:

  d783e0bd02e700e7a893ef4fa71c69438ac1c276 nvme: avoid cqe corruption
  when update at the same time as read

  dmesg sample:

  [Wed Aug 15 01:11:21 2018] nvme 0000:00:1f.0: I/O 8 QID 1 timeout, aborting
  [Wed Aug 15 01:11:21 2018] nvme 0000:00:1f.0: I/O 9 QID 1 timeout, aborting
  [Wed Aug 15 01:11:21 2018] nvme 0000:00:1f.0: I/O 21 QID 2 timeout, aborting
  [Wed Aug 15 01:11:32 2018] nvme 0000:00:1f.0: I/O 10 QID 1 timeout, aborting
  [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: I/O 8 QID 1 timeout, reset 
controller
  [Wed Aug 15 01:11:51 2018] nvme nvme1: Abort status: 0x2
  [Wed Aug 15 01:11:51 2018] nvme nvme1: Abort status: 0x2
  [Wed Aug 15 01:11:51 2018] nvme nvme1: Abort status: 0x2
  [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 21 QID 2
  [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: completing aborted command with 
status: 0007
  [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 
83887751
  [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 
83887751
  [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 22 QID 2
  [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 
83887767
  [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 
83887767
  [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 23 QID 2
  [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 
83887769
  [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 
83887769
  [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 8 QID 1
  [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 9 QID 1
  [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: completing aborted command with 
status: 0007
  [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 
41943136
  [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 10 QID 1
  [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: completing aborted command with 
status: 0007
  [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 
6976
  [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 22 QID 1
  [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 23 QID 1
  [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 24 QID 1
  [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 25 QID 1
  [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 2 QID 0
  [Wed Aug 15 01:11:51 2018] nvme nvme1: Abort status: 0x7
  [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: completing aborted command with 
status: fffffffc
  [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 
96
  [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): metadata I/O error: block 0x5000687 
("xlog_iodone") error 5 numblks 64
  [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_do_force_shutdown(0x2) called 
from line 1197 of file /build/linux-c2Z51P/linux-4.4.0/fs/xfs/xfs_log.c. Return 
address = 0xffffffffc075d428
  [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_log_force: error -5 returned.
  [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): Log I/O Error Detected. Shutting 
down filesystem
  [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): Please umount the filesystem and 
rectify the problem(s)
  [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 
872, lost async page write
  [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_imap_to_bp: 
xfs_trans_read_buf() returned error -5.
  [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_iunlink_remove: xfs_imap_to_bp 
returned error -5.
  [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 
873, lost async page write
  [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 
874, lost async page write
  [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 
875, lost async page write
  [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 
876, lost async page write
  [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 
877, lost async page write
  [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 
878, lost async page write
  [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 
879, lost async page write
  [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 
880, lost async page write
  [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 
881, lost async page write
  [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): metadata I/O error: block 0x5000697 
("xlog_iodone") error 5 numblks 64
  [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_log_force: error -5 returned.
  [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_do_force_shutdown(0x2) called 
from line 1197 of file /build/linux-c2Z51P/linux-4.4.0/fs/xfs/xfs_log.c. Return 
address = 0xffffffffc075d428
  [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): metadata I/O error: block 0x5000699 
("xlog_iodone") error 5 numblks 64
  [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_do_force_shutdown(0x2) called 
from line 1197 of file /build/linux-c2Z51P/linux-4.4.0/fs/xfs/xfs_log.c. Return 
address = 0xffffffffc075d428
  [Wed Aug 15 01:12:20 2018] XFS (nvme1n1): xfs_log_force: error -5 returned.
  [Wed Aug 15 01:12:22 2018] nvme 0000:00:1f.0: I/O 22 QID 1 timeout, aborting
  [Wed Aug 15 01:12:22 2018] nvme 0000:00:1f.0: I/O 23 QID 1 timeout, aborting
  [Wed Aug 15 01:12:22 2018] nvme 0000:00:1f.0: I/O 24 QID 1 timeout, aborting
  [Wed Aug 15 01:12:22 2018] nvme 0000:00:1f.0: I/O 25 QID 1 timeout, aborting
  [Wed Aug 15 01:12:22 2018] nvme 0000:00:1f.0: I/O 24 QID 2 timeout, aborting
  [Wed Aug 15 01:12:22 2018] nvme nvme1: Abort status: 0x2
  [Wed Aug 15 01:12:22 2018] nvme nvme1: Abort status: 0x2
  [Wed Aug 15 01:12:22 2018] nvme nvme1: Abort status: 0x2
  [Wed Aug 15 01:12:22 2018] nvme nvme1: Abort status: 0x2
  [Wed Aug 15 01:12:22 2018] nvme nvme1: Abort status: 0x2
  [Wed Aug 15 01:12:50 2018] XFS (nvme1n1): xfs_log_force: error -5 returned.
  [Wed Aug 15 01:12:52 2018] nvme 0000:00:1f.0: I/O 22 QID 1 timeout, reset 
controller
  [Wed Aug 15 01:13:21 2018] XFS (nvme1n1): xfs_log_force: error -5 returned.
  [Wed Aug 15 01:13:51 2018] XFS (nvme1n1): xfs_log_force: error -5 returned.
  [Wed Aug 15 01:14:21 2018] XFS (nvme1n1): xfs_log_force: error -5 returned.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1788035/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to