Re: Error when running fio against nvme-of rdma target (mlx5 driver)

2022-02-10 Thread Martin Oliveira
On 2/9/22 1:41 AM, Chaitanya Kulkarni wrote:
> On 2/8/22 6:50 PM, Martin Oliveira wrote:
> > Hello,
> >
> > We have been hitting an error when running IO over our nvme-of setup, using 
> > the mlx5 driver and we are wondering if anyone has seen anything 
> > similar/has any suggestions.
> >
> > Both initiator and target are AMD EPYC 7502 machines connected over RDMA 
> > using a Mellanox MT28908. Target has 12 NVMe SSDs which are exposed as a 
> > single NVMe fabrics device, one physical SSD per namespace.
> >
> 
> Thanks for reporting this, if you can bisect the problem on your setup
> it will help others to help you better.
> 
> -ck

Hi Chaitanya,

I went back to a kernel as old as 4.15 and the problem was still there, so I 
don't know of a good commit to start from.

I also learned that I can reproduce this with as little as 3 cards and I 
updated the firmware on the Mellanox cards to the latest version.

I'd be happy to try any tests if someone has any suggestions.

Thanks,
Martin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Error when running fio against nvme-of rdma target (mlx5 driver)

2022-02-08 Thread Martin Oliveira
Hello,

We have been hitting an error when running IO over our nvme-of setup, using the 
mlx5 driver and we are wondering if anyone has seen anything similar/has any 
suggestions.

Both initiator and target are AMD EPYC 7502 machines connected over RDMA using 
a Mellanox MT28908. Target has 12 NVMe SSDs which are exposed as a single NVMe 
fabrics device, one physical SSD per namespace.

When running an fio job targeting directly the fabrics devices (no filesystem, 
see script at the end), within a minute or so we start seeing errors like this:

[  408.368677] mlx5_core :c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x002f address=0x24d08000 flags=0x]
[  408.372201] infiniband mlx5_0: mlx5_handle_error_cqe:332:(pid 0): WC error: 
4, Message: local protection error
[  408.380181] infiniband mlx5_0: dump_cqe:272:(pid 0): dump error cqe
[  408.380187] : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  408.380189] 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  408.380191] 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  408.380192] 0030: 00 00 00 00 a9 00 56 04 00 00 01 e9 00 54 e8 e2
[  408.380230] nvme nvme15: RECV for CQE 0xce392ed9 failed with status 
local protection error (4)
[  408.380235] nvme nvme15: starting error recovery
[  408.380238] nvme_ns_head_submit_bio: 726 callbacks suppressed
[  408.380246] block nvme15n2: no usable path - requeuing I/O
[  408.380284] block nvme15n5: no usable path - requeuing I/O
[  408.380298] block nvme15n1: no usable path - requeuing I/O
[  408.380304] block nvme15n11: no usable path - requeuing I/O
[  408.380304] block nvme15n11: no usable path - requeuing I/O
[  408.380330] block nvme15n1: no usable path - requeuing I/O
[  408.380350] block nvme15n2: no usable path - requeuing I/O
[  408.380371] block nvme15n6: no usable path - requeuing I/O
[  408.380377] block nvme15n6: no usable path - requeuing I/O
[  408.380382] block nvme15n4: no usable path - requeuing I/O
[  408.380472] mlx5_core :c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x002f address=0x24d09000 flags=0x]
[  408.391265] mlx5_core :c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x002f address=0x24d0a000 flags=0x]
[  415.125967] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
[  415.131898] nvmet: ctrl 1 fatal error occurred!

Occasionally, we've seen the following stack trace:

[ 1158.152464] kernel BUG at drivers/iommu/amd/io_pgtable.c:485!
[ 1158.427696] invalid opcode:  [#1] SMP NOPTI
[ 1158.432228] CPU: 51 PID: 796 Comm: kworker/51:1H Tainted: P   OE 
5.13.0-eid-athena-g6fb4e704d11c-dirty #14
[ 1158.443867] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21 
10/08/2020
[ 1158.451252] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
[ 1158.456884] RIP: 0010:iommu_v1_unmap_page+0xed/0x100
[ 1158.461849] Code: 48 8b 45 d0 65 48 33 04 25 28 00 00 00 75 1d 48 83 c4 10 
4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 49 8d 46 ff 4c 85 f0 74 d6 <0f> 0b e8 
1c 38 46 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44
[ 1158.480589] RSP: 0018:abb520587bd0 EFLAGS: 00010206
[ 1158.485812] RAX: 000100061fff RBX: 0010 RCX: 0027
[ 1158.492938] RDX: 30562000 RSI:  RDI: 
[ 1158.500071] RBP: abb520587c08 R08: abb520587bd0 R09: 
[ 1158.507202] R10: 0001 R11: 000ff000 R12: 9984abd9e318
[ 1158.514326] R13: 9984abd9e310 R14: 000100062000 R15: 0001
[ 1158.521452] FS:  () GS:99a36c8c() 
knlGS:
[ 1158.529540] CS:  0010 DS:  ES:  CR0: 80050033
[ 1158.535286] CR2: 7f75b04f1000 CR3: 0001eddd8000 CR4: 00350ee0
[ 1158.542419] Call Trace:
[ 1158.544877]  amd_iommu_unmap+0x2c/0x40
[ 1158.548653]  __iommu_unmap+0xc4/0x170
[ 1158.552344]  iommu_unmap_fast+0xe/0x10
[ 1158.556100]  __iommu_dma_unmap+0x85/0x120
[ 1158.560115]  iommu_dma_unmap_sg+0x95/0x110
[ 1158.564213]  dma_unmap_sg_attrs+0x42/0x50
[ 1158.568225]  rdma_rw_ctx_destroy+0x6e/0xc0 [ib_core]
[ 1158.573201]  nvmet_rdma_rw_ctx_destroy+0xa7/0xc0 [nvmet_rdma]
[ 1158.578944]  nvmet_rdma_read_data_done+0x5c/0xf0 [nvmet_rdma]
[ 1158.584683]  __ib_process_cq+0x8e/0x150 [ib_core]
[ 1158.589398]  ib_cq_poll_work+0x2b/0x80 [ib_core]
[ 1158.594027]  process_one_work+0x220/0x3c0
[ 1158.598038]  worker_thread+0x4d/0x3f0
[ 1158.601696]  kthread+0x114/0x150
[ 1158.604928]  ? process_one_work+0x3c0/0x3c0
[ 1158.609114]  ? kthread_park+0x90/0x90
[ 1158.612783]  ret_from_fork+0x22/0x30

We first saw this on a 5.13 kernel but could reproduce with 5.17-rc2.

We found a possibly related bug report [1] that suggested disabling the IOMMU 
could help, but even after I disabled it (amd_iommu=off iommu=off) I still get 
errors (nvme IO timeouts). Another thread from 2016[2] suggested that disabling 
some kernel debug options could workaround the "local protection error" but