Hi all,

Second of three bugs from my personal-lab Lustre master build run on
DGX Spark / Ubuntu 24.04 ARM64 / kernel 6.17.0-1014-nvidia (full context
in the previous mail "[BUG] osd-zfs configure writes wrong KBUILD_EXTRA_SYMBOLS
path against OpenZFS 2.4+").


Summary
-------

Lustre master ./configure --with-o2ib=yes calls a shell helper that
uses readlink to detect the IB compatibility path. When no external IB
packages are present (no MOFED, no external OFED — only in-kernel mlx5),
readlink is invoked without an operand and emits

  readlink: missing operand

repeatedly during configure. The configure proceeds, but the o2ib
auto-detect branch may resolve incorrectly — in particular, it can
report that o2ib needs Compat RDMA when it does not, leading to broken
osd-zfs / ko2iblnd builds on systems with in-kernel IB only. NVIDIA DGX
Spark on Ubuntu 24.04 ARM64 (mlx5 in-tree, no external OFED) is one
such system.


Environment
-----------

  Kernel:    6.17.0-1014-nvidia (NVIDIA-signed Ubuntu kernel)
  OS:        Ubuntu 24.04.1 LTS ARM64
  Lustre:    master @ 805cece6747f442449f32a1d25a8b8a03b230875
  Hardware:  NVIDIA DGX Spark (Grace UMA workstation, ARM64)
  IB stack:  in-kernel mlx5 only; no ofed-internal* / mlnx-ofa_kernel*
             packages installed


Reproduction
------------

On a system with in-kernel IB and no external OFED packages:

  cd lustre-release
  git checkout 805cece6747f442449f32a1d25a8b8a03b230875
  sh autogen.sh
  ./configure --with-linux=/lib/modules/$(uname -r)/build \
              --with-zfs=/path/to/zfs \
              --disable-ldiskfs \
              --with-o2ib=yes

stderr will show repeated "readlink: missing operand" lines during the
o2ib detection phase. Configure completes, but the o2ib branch may go
down a wrong path.


Expected behavior
-----------------

When --with-o2ib=yes is used on a system where no external IB packages
are installed, ./configure either:

  (a) detects in-kernel IB cleanly and proceeds without spurious readlink
      errors, or
  (b) emits a clear diagnostic suggesting the explicit-path form
      (--with-o2ib=/lib/modules/$(uname -r)/build) and exits.


Actual behavior
---------------

readlink is invoked without an operand inside the auto-detect helper
and emits "readlink: missing operand" repeatedly. Configure proceeds
silently past the issue, and the resulting build may be incorrectly
configured for Compat RDMA on a system that doesn't need it.


Workaround (measured working)
-----------------------------

Pass the kernel-headers path explicitly instead of letting the auto-detect
run:

  ./configure ... --with-o2ib=/lib/modules/$(uname -r)/build

This bypasses the readlink-based helper entirely. Build completes
cleanly, ko2iblnd loads, and o2ib LNet works against in-kernel mlx5.
Validated end-to-end in the linked reproduce kit.


Suggested fix
-------------

Guard the readlink call in the o2ib auto-detect helper with an existence
check on the path argument before invoking, e.g.

  if [ -n "$path" ] && [ -e "$path" ]; then
      target=$(readlink "$path")
  fi

The exact location is in the m4/lustre-build-linux.m4 (or related) o2ib
detection block — readlink is invoked on a candidate path that may not
exist when no external OFED is present.


Reference
---------

Public reproduce kit (build cascade documented end-to-end):

  
https://github.com/knachiketa04/aihomelab/tree/main/artifacts/training/lustre-on-uma-workstations/reproduce

Thanks,
Kumar
[email protected]
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to