Hi All,
We recently upgraded from Lustre 2.5.3.90 on EL6 to 2.10.1 on EL7 (details
below) but have hit what looks like LU-10133 (order 8 page allocation failures).
We don’t have access to look at the JIRA ticket in more detail but from what we
can tell the the fix is to change from vmalloc() to vmalloc_array() in the mlx4
drivers. However, the vmalloc_array() infrastructure is in an upstream (far
upstream) kernel so I’m not sure when we’ll see that fix.
While this may not be a Lustre issue directly, I know we can’t be the only
Lustre site running 2.10.1 over IB on Mellanox ConnectX-3 HCAs. So far we have
tried increasing vm.min_free_kbytes to 8GB but that does not help.
Zone_reclaim_mode is disabled (for other reasons that may not be valid under
EL7) but order 8 chunks get depleted on both NUMA nodes so I’m not sure that is
the answer either (though we have not tried it yet).
[root@ufrcmds1 ~]# cat /proc/buddyinfo
Node 0, zone DMA 1 0 0 0 2 1 1 0
1 1 3
Node 0, zone DMA32 1554 13496 11481 5108 150 0 0 0
0 0 0
Node 0, zone Normal 114119 208080 78468 35679 6215 690 0 0
0 0 0
Node 1, zone Normal 81295 184795 106942 38818 4485 293 1653 0
0 0 0
I’m wondering if other sites are hitting this and, if so, what are you doing to
work around the issue on your OSSs.
Regards,
Charles Taylor
UF Research Computing
Some Details:
-------------------
OS: RHEL 7.4 (Linux ufrcoss28.ufhpc 3.10.0-693.2.2.el7_lustre.x86_64)
Lustre: 2.10.1 (lustre-2.10.1-1.el7.x86_64)
Clients: ~1400 (still running 2.5.3.90 but we are in the process of upgrading)
Servers: 10 HA OSS pairs (20 OSSs)
128 GB RAM
6 OSTs (8+2 RAID-6) per OSS
Mellanox ConnectX-3 IB/VPI HCAs
RedHat Native IB Stack (i.e. not MOFED)
mlx4_core driver:
filename:
/lib/modules/3.10.0-693.2.2.el7_lustre.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko.xz
version: 2.2-1
license: Dual BSD/GPL
description: Mellanox ConnectX HCA low-level driver
author: Roland Dreier
rhelversion: 7.4_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org