Hi all,
Second of three bugs from my personal-lab Lustre master build run on
DGX Spark / Ubuntu 24.04 ARM64 / kernel 6.17.0-1014-nvidia (full context
in the previous mail "[BUG] osd-zfs configure writes wrong KBUILD_EXTRA_SYMBOLS
path against OpenZFS 2.4+").
Summary
-------
Lustre master ./configure --with-o2ib=yes calls a shell helper that
uses readlink to detect the IB compatibility path. When no external IB
packages are present (no MOFED, no external OFED — only in-kernel mlx5),
readlink is invoked without an operand and emits
readlink: missing operand
repeatedly during configure. The configure proceeds, but the o2ib
auto-detect branch may resolve incorrectly — in particular, it can
report that o2ib needs Compat RDMA when it does not, leading to broken
osd-zfs / ko2iblnd builds on systems with in-kernel IB only. NVIDIA DGX
Spark on Ubuntu 24.04 ARM64 (mlx5 in-tree, no external OFED) is one
such system.
Environment
-----------
Kernel: 6.17.0-1014-nvidia (NVIDIA-signed Ubuntu kernel)
OS: Ubuntu 24.04.1 LTS ARM64
Lustre: master @ 805cece6747f442449f32a1d25a8b8a03b230875
Hardware: NVIDIA DGX Spark (Grace UMA workstation, ARM64)
IB stack: in-kernel mlx5 only; no ofed-internal* / mlnx-ofa_kernel*
packages installed
Reproduction
------------
On a system with in-kernel IB and no external OFED packages:
cd lustre-release
git checkout 805cece6747f442449f32a1d25a8b8a03b230875
sh autogen.sh
./configure --with-linux=/lib/modules/$(uname -r)/build \
--with-zfs=/path/to/zfs \
--disable-ldiskfs \
--with-o2ib=yes
stderr will show repeated "readlink: missing operand" lines during the
o2ib detection phase. Configure completes, but the o2ib branch may go
down a wrong path.
Expected behavior
-----------------
When --with-o2ib=yes is used on a system where no external IB packages
are installed, ./configure either:
(a) detects in-kernel IB cleanly and proceeds without spurious readlink
errors, or
(b) emits a clear diagnostic suggesting the explicit-path form
(--with-o2ib=/lib/modules/$(uname -r)/build) and exits.
Actual behavior
---------------
readlink is invoked without an operand inside the auto-detect helper
and emits "readlink: missing operand" repeatedly. Configure proceeds
silently past the issue, and the resulting build may be incorrectly
configured for Compat RDMA on a system that doesn't need it.
Workaround (measured working)
-----------------------------
Pass the kernel-headers path explicitly instead of letting the auto-detect
run:
./configure ... --with-o2ib=/lib/modules/$(uname -r)/build
This bypasses the readlink-based helper entirely. Build completes
cleanly, ko2iblnd loads, and o2ib LNet works against in-kernel mlx5.
Validated end-to-end in the linked reproduce kit.
Suggested fix
-------------
Guard the readlink call in the o2ib auto-detect helper with an existence
check on the path argument before invoking, e.g.
if [ -n "$path" ] && [ -e "$path" ]; then
target=$(readlink "$path")
fi
The exact location is in the m4/lustre-build-linux.m4 (or related) o2ib
detection block — readlink is invoked on a candidate path that may not
exist when no external OFED is present.
Reference
---------
Public reproduce kit (build cascade documented end-to-end):
https://github.com/knachiketa04/aihomelab/tree/main/artifacts/training/lustre-on-uma-workstations/reproduce
Thanks,
Kumar
[email protected]
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org