Hello Carlos,

I'm sory that it didn't work.


One question: are you using the precompiled Lustre RPMs (e.g. those available 
from: https://downloads.whamcloud.com/public/lustre/lustre-2.15.6/ ) or are you 
compiling your own RPMs from the Lustre git repository ( 
https://github.com/lustre/lustre-release ) ?


In our case we use the second approach and I think it is better for two reasons:


1- You make sure that everything is consistent, especially with your MOFED 
environment

2- You are not forced to use the specific versions corresponding to tags 
exactly, you can chose any version available in git repository or cherry-pick 
the fixes you think are useful (more details on this later).


In our case we upgraded last week a small HPC cluster using RHEL 8 for the file 
server and RHEL 9 for the clients. The update was successful and we had no 
problem related to MOFED, Lustre, PMIx, Slurm, MPI (including MPI-IO) up to now.


Our upgrade is described in a message posted on this mailing list on April 7th:


http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2025-April/019471.html


As you see we plan also to add additional storage (OSTs) soon by connecting an 
new MSA 2060 to our file server (this file server play the role of MGS, MDS and 
OSS). And as you see also we didn't compiled Lustre 2.15.6 exactly. We compiled 
a commit on the 2.15 branch containing 2.15.6 plus tree additional patches, 
including LU-18085. Many users, using 2.15.6 without this patch (LU-18085) 
complained on lustre-discuss and unfortunately it was added to the 2.15 branch 
only a few days after 2.15.6 was released. Look at this thread for example on 
lustre-discuss mailing list:


http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2025-April/019474.html


I will now explain you an outline of our procedure to get Lustre on our RHEL 
8.10 server. It may be overkill but I think it takes all the precautions and it 
worked in our case:


  1.  Install RHEL 8.10 on the system using the base kernel you want to patch 
(4.18.0-553.27.1 in our case). Don't forget kernel-headers (for compiling 
MOFED) and have kernel source RPM available (to compile the patched kernel)
  2.  Compile the MOFED RPMs corresponding to the MOFED version you chose (ex:  
24.10-2.1.8.0-LTS) using the mlnx_add_kernel_support.sh script with --kmp option
  3.  Install the MOFED RPMs (they will uninstall OFED from the Linux distro) 
(in our case we install: mlnx-ofed-all knem mlnxofed-docs libxpmem-devel)
  4.  Reboot (to activate the new MOFED)
  5.  Test MOFED
  6.  Compile the RPMs corresponding to the patched Lustre kernel (you will 
need the kernel source)
  7.  Put the resulting RPMs on a web server and setup an RPM repository 
(createrepo_c) so that they can be used during the next system installation
  8.  Re-install the system by making sure that your kickstart file refer to 
the repository containing the Lustre patched kernel RPMs (they must hide the 
corresponding distro RPMs) and reboot
  9.  repeat step 2, to compile a new MOFED since the patched kernel is 
different
  10. repeat step 3 and 4, your system will now have a MOFED that correspond 
exactly to your kernel patched for Lustre and not to the base kernel (because 
it is not even installed on the system)
  11. repeat step 5 to test the MOFED on the new kernel
  12. Compile the server specific RPMs related to Lustre
  13. Install those server RPMs (in our case: kmod-lustre 
kmod-lustre-osd-ldiskfs lustre{,-devel} lustre-iokit lustre-osd-ldiskfs-mount)
  14. Configure Lustre (ex: /etc/lnet.conf, /etc/fstab, enable lnet.service)
  15. Reboot
  16. With little luck the Lustre server should be operational

I hope this helps, good luck !

Martin Audet

________________________________
From: Carlos Adean <[email protected]>
Sent: April 23, 2025 9:06 PM
To: Audet, Martin; [email protected]
Cc: Eloir Troyack
Subject: EXT: Re: [lustre-discuss] Installing lustre 2.15.6 server on rhel-8.10 
fails

***Attention*** This email originated from outside of the NRC. ***Attention*** 
Ce courriel provient de l'extérieur du CNRC.

Hello Martin,

Thank you for the hint.

I tried rebuilding using the suggested parameter, but the warnings persist.

Additionally, the system still fails to boot using the lustre kernel.

We noticed that Lustre's kernel image does not have the megaraid_sas module, 
which is used by the system to enable the Dell PERC H330 controller. This may 
be the cause of the boot failure.

[root@mds2 ~]# lsinitrd /boot/initramfs-4.18.0-553.27.1.el8_lustre.x86_64.img | 
grep megaraid_sas [root@mds2 ~]#

However, this is not true for the kernel image installed via dnf.

[root@mds2 ~]# lsinitrd /boot/initramfs-4.18.0-553.27.1.el8_10.x86_64.img | 
grep megaraid_sas -rw-r--r-- 1 root root 72560 Jan 15 2024 
usr/lib/modules/4.18.0-553.27.1.el8_10.x86_64/kernel/drivers/scsi/megaraid/megaraid_sas.ko.xz
 [root@mds2 ~]#

I'm still here struggling to install it.


---
Carlos Adean
www.linea.org.br<https://www.linea.org.br>


Em qua., 23 de abr. de 2025 às 09:22, Audet, Martin 
<[email protected]<mailto:[email protected]>> escreveu:

Hello,


I think I had a similar problem a long time ago and it was solved by adding the 
"--kmp" option to  "mlnx_add_kernel_support.sh" script when compiling MOFED 
RPMs. Without this option, the MOFED RPM compilation complete without problems, 
the same thing when compiling Lustre RPMs but later, when installing Lustre 
RPMs, we get a bunch of problems related to symbols.


Here is how I compile the MOFED RPMs (uning the root account):


# mount_dir is the temporary mount directory

# ofed_iso  is the MOFED .iso file

#
mkdir -p -- $mount_dir

mount -o ro,loop $ofed_iso $mount_dir

$mount_dir/mlnx_add_kernel_support.sh -y --make-tgz --kmp -k $(uname -r) -m 
$mount_dir

#

# The compiled RPMs are now under /tmp

# ex: /tmp/MLNX_OFED_LINUX-24.10-2.1.8.0-rhel8.10.x86_64-ext.tgz


It seems that the pre-compiled RPMs distributed by Mellanox/NVIDIA are always 
generated using the --kmp but when using mlnx_add_kernel_support.sh, this 
option must be explicitly specified. In addition, it seems that with the newer 
DOCA OFED, the using script equivatent to mlnx_add_kernel_support.sh always add 
--kmp option on RHEL and similar distributions.


I hope it helps,


Martin

________________________________
From: lustre-discuss 
<[email protected]<mailto:[email protected]>>
 on behalf of Carlos Adean via lustre-discuss 
<[email protected]<mailto:[email protected]>>
Sent: April 22, 2025 11:09 PM
To: [email protected]<mailto:[email protected]>
Cc: Eloir Troyack
Subject: EXT: [lustre-discuss] Installing lustre 2.15.6 server on rhel-8.10 
fails

***Attention*** This email originated from outside of the NRC. ***Attention*** 
Ce courriel provient de l'extérieur du CNRC.

Hello all,

My current version of RHEL 8 is Rocky Linux 8.10, running the kernel 
4.18.0-553.27.1.el8_10. I also have the OFED drivers version 24.10-2.1.8.0 
installed for the InfiniBand interface (I tried without OFED before).

The installation of "kmod-lustre-2.15.6-1.el8" and 
"kmod-lustre-osd-ldiskfs-2.15.6-1" always shows these warning messages below.

# dnf --nogpgcheck --enablerepo=lustre-server install kmod-lustre 
kmod-lustre-osd-ldiskfs lustre-osd-ldiskfs-mount lustre lustre-resource-agents
[...]
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol __ib_alloc_pd
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol rdma_resolve_addr
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol ib_dereg_mr_user
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol rdma_reject
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol rdma_disconnect
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol __rdma_create_kernel_id
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol ib_register_event_handler
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol rdma_resolve_route
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol ib_unregister_event_handler
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol rdma_bind_addr
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol rdma_create_qp
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol ib_map_mr_sg
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol ib_query_port
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol rdma_notify
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol rdma_listen
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol rdma_destroy_qp
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol __ib_create_cq
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol ib_alloc_mr
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol rdma_connect_locked
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol rdma_set_reuseaddr
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol ib_destroy_cq_user
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol ib_modify_qp
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol ib_dma_virt_map_sg
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol rdma_destroy_id
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol rdma_accept
depmod: WARNING: 
/lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko 
needs unknown symbol ib_dealloc_pd_user
[...]
Installed:
  kernel-core-4.18.0-553.27.1.el8_lustre.x86_64    
kmod-lustre-2.15.6-1.el8.x86_64    kmod-lustre-osd-ldiskfs-2.15.6-1.el8.x86_64  
  lustre-2.15.6-1.el8.x86_64    lustre-osd-ldiskfs-mount-2.15.6-1.el8.x86_64
  lustre-resource-agents-2.15.6-1.el8.x86_64

Completed!


After rebooting, the server drops into an emergency shell because it can't find 
the LVM devices. This issue only occurs with the Lustre kernel, other installed 
kernels boot normally.


Any hints on how to proceed?


---
Carlos Adean
www.linea.org.br<https://www.linea.org.br>
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to