FYI,
We encountered another issue when using RHEL IB kernel drivers. There
have been changes to ko2iblnd module parameters on lustre clients that
are not compatible with the RHEL IB stack.
The changes are intended to improve performance on lustre when using
truescale hardware, and support newer generation of mellanox adapters.
Details are in LU-3322, LU-6723, LU-7101 From the ticket notes, its
pretty clear RHEL rdma stack is not part of their testing recipe.
General overview of performance tuning is in openfabrics presentation:
http://downloads.openfabrics.org/downloads/Media/OFSUG_2015/Friday/friday_02.pdf
FWIW, ko2iblnd tuning failed for us in a mixed IB environment where
lustre servers use different HCAs from the lustre clients. Our solution
was to modify the modprobe ko2iblnd config and remove the new paramaters
(ie. use the defaults).
regards,
chris hunter
[email protected]
On 11/19/2015 10:33 AM, Lassus, Magnus wrote:
Thank you very much Chris. True scale it is and using 2.6.32-504.23.4 solved it.
Regards,
Magnus
-----Original Message-----
From: Chris Hunter [mailto:[email protected]]
Sent: 18 November 2015 23:14
To: Lassus, Magnus <[email protected]>
Cc: [email protected]
Subject: Re: [lustre-discuss] o2ib (ib_qib) with 2.7.0 rpms on centos 6.6
Are you using truescale IB interfaces ?
There is a known truescale bug in rhel/centos 6.6 kernels. You should
try kernel 2.6.32-504.23.4 or newer. Some details of the bug are in
LU-6698 and RHSA-2015-1081.
regards,
chris hunter
yale hpc group
From: "Lassus, Magnus" <[email protected]>
To: "[email protected]"
<[email protected]>
Subject: [lustre-discuss] o2ib (ib_qib) with 2.7.0 rpms on centos 6.6:
LNetError: kiblnd_init_rdma: Src buffer exhausted: 1 frags
Message-ID:
<he1pr04mb1273c36e676e1824d8b2e3a494...@he1pr04mb1273.eurprd04.prod.outlook.com>
Content-Type: text/plain; charset="us-ascii"
Hi,
I fail to understand where I go wrong in getting o2ib working using 2.7.0 rpms
on top of CentOS 6.6. Running selftest I see:
Nov 17 18:22:40 ss08 kernel: LNet: Added LNI 10.165.32.18@o2ib [8/256/0/180]
Nov 17 18:24:40 ss08 kernel: LNetError:
12532:0:(o2iblnd_cb.c:1123:kiblnd_init_rdma()) Src buffer exhausted: 1 frags
Nov 17 18:24:40 ss08 kernel: LustreError:
12553:0:(brw_test.c:212:brw_check_page()) Bad data in page ffffea0070c20800:
0xbeefbeefbeefbeef, 0xeeb0eeb1eeb2eeb3 expec
Nov 17 18:24:40 ss08 kernel: LustreError:
12553:0:(brw_test.c:238:brw_check_bulk()) Bulk page ffffea0070c20800 (0/256) is
corrupted!
Nov 17 18:24:40 ss08 kernel: LustreError:
12553:0:(brw_test.c:343:brw_client_done_rpc()) Bulk data from
12345-10.165.32.18@o2ib is corrupted!
Nov 17 18:24:40 ss08 kernel: LNetError:
12532:0:(o2iblnd_cb.c:1690:kiblnd_reply()) Can't setup rdma for GET from
10.165.32.18@o2ib: -71
Nov 17 18:25:31 ss08 kernel: LNetError:
12529:0:(o2iblnd_cb.c:3036:kiblnd_check_txs_locked()) Timed out tx: active_txs,
0 seconds
Nov 17 18:25:31 ss08 kernel: LNetError:
12529:0:(o2iblnd_cb.c:3099:kiblnd_check_conns()) Timed out RDMA with
10.165.32.18@o2ib (0): c: 7, oc: 0, rc: 7
Nov 17 18:25:31 ss08 kernel: LustreError:
12558:0:(brw_test.c:388:brw_bulk_ready()) BRW bulk WRITE failed for RPC from
12345-10.165.32.18@o2ib: -103
Nov 17 18:25:31 ss08 kernel: LustreError:
12558:0:(brw_test.c:362:brw_server_rpc_done()) Bulk transfer from
12345-10.165.32.18@o2ib has failed: -5
Nov 17 18:25:48 ss08 kernel: LNet:
12581:0:(rpc.c:1077:srpc_client_rpc_expired()) Client RPC expired: service 11,
peer 12345-10.165.32.18@o2ib, timeout 64.
Nov 17 18:25:48 ss08 kernel: LustreError:
12555:0:(brw_test.c:318:brw_client_done_rpc()) BRW RPC to
12345-10.165.32.18@o2ib failed with -110
# rpm -qa | egrep 'lustre|kernel' | sort
dracut-kernel-004-356.el6.noarch
kernel-2.6.32-504.8.1.el6_lustre.x86_64
kernel-devel-2.6.32-504.8.1.el6_lustre.x86_64
kernel-firmware-2.6.32-504.8.1.el6_lustre.x86_64
kernel-headers-2.6.32-504.8.1.el6_lustre.x86_64
lustre-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64
lustre-iokit-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64
lustre-modules-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64
lustre-osd-ldiskfs-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64
lustre-osd-ldiskfs-mount-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64
lustre-tests-2.7.0-2.6.32_504.8.1.el6_lustre.x86_64.x86_64
perf-2.6.32-504.8.1.el6_lustre.x86_64
python-perf-2.6.32-504.8.1.el6_lustre.x86_64
Using latest 2.7.63 build on 6.7 works.
Any pointers are warmly welcome as I'd prefer to use 2.7.0.
Regards,
Magnus
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_pipermail_lustre-2Ddiscuss-2Dlustre.org_attachments_20151118_bc19b61a_attachment.html&d=AwICAg&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=d_G2h_sZYG4xtHMeKo8QgjDmOcMVdQvYgM-5Dri1AOY&m=yntd6s6FbhcK6yz7f--sTQB8uauio2sPpZXJO07_GMM&s=fmaW2S-MSdcgBPqEnTVELb9GaBrR0zwaQlFI9_QrbYw&e=
>
------------------------------
Subject: Digest Footer
_______________________________________________
lustre-discuss mailing list
[email protected]
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=AwICAg&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=d_G2h_sZYG4xtHMeKo8QgjDmOcMVdQvYgM-5Dri1AOY&m=yntd6s6FbhcK6yz7f--sTQB8uauio2sPpZXJO07_GMM&s=XPhf61e64WjkcxWw05wudsYWLfRBfsN0OiJF8O2DYE4&e=
------------------------------
End of lustre-discuss Digest, Vol 116, Issue 9
**********************************************
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org