Hi all,
Third and last of three bugs from my personal-lab Lustre master build
run on DGX Spark / Ubuntu 24.04 ARM64 / kernel 6.17.0-1014-nvidia (full
context in the previous mails on KBUILD_EXTRA_SYMBOLS and
--with-o2ib=yes readlink).
This one is a memory-safety issue rather than a build issue, and it
blocks productive use of multi-rail LNet on ARM64 hardware where
multiple QSFP NICs are otherwise available.
Summary
-------
lnetctl (Lustre master 805cece6 / lctl 2.17.52_125_g805cece on ARM64,
glibc on Ubuntu 24.04) trips its glibc stack canary
*** stack smashing detected ***: terminated
when handling certain error paths returned from kernel-side LNet. I've
reproduced this on at least two distinct cases, with different
kernel-side error returns, both terminating lnetctl userland the same
way. This pattern suggests a missing bounds check or stack-allocated
buffer overrun in lnetctl's error-path code rather than something
specific to either error.
Case A — adding tcp interface lo (ksocklnd rejection):
sudo lnetctl net add --net tcp --if lo
kernel logs:
config.c:1591:lnet_inet_select() ksocklnd: failed to find UP
interface lo
lnetctl userland:
*** stack smashing detected ***: terminated
Aborted (core dumped)
Case B — multi-rail second o2ib add:
sudo lnetctl net add --net o2ib1 --if enP2p1s0f0np0
(after o2ib0 is already configured on the primary QSFP NIC; some
o2ib-add error path is taken on the second-rail attempt)
lnetctl userland:
*** stack smashing detected ***: terminated
Aborted (core dumped)
Two different kernel error returns, same lnetctl userland mishandling.
Environment
-----------
Kernel: 6.17.0-1014-nvidia (NVIDIA-signed Ubuntu kernel)
OS: Ubuntu 24.04.1 LTS ARM64
Lustre: master @ 805cece6747f442449f32a1d25a8b8a03b230875
(lctl 2.17.52_125_g805cece)
glibc: Ubuntu 24.04 stock (with stack-protector enabled)
Hardware: NVIDIA DGX Spark (Grace UMA workstation, ARM64)
2x QSFP NICs (Mellanox in-kernel mlx5)
Reproduction
------------
For case A (simplest):
# Build + install Lustre master per build instructions, then:
sudo modprobe lnet
sudo modprobe ksocklnd
sudo lnetctl lnet configure
sudo lnetctl net add --net tcp --if lo
# Observe stack-smash abort in lnetctl; check dmesg for the
# ksocklnd "failed to find UP interface lo" message.
For case B, the trigger is configuring a second o2ib network on a
second QSFP NIC after o2ib0 is already up. Same stack-smash abort
pattern in lnetctl.
Expected behavior
-----------------
When the kernel returns an error from a net-add operation, lnetctl
prints a clear diagnostic and exits cleanly with a non-zero status —
no stack canary trip, no abort.
Actual behavior
---------------
lnetctl's stack canary fires and the process is terminated by glibc.
The kernel-side state may or may not reflect the partial operation
depending on which case (kernel error returned before any state change
in case A; case B may have left partial state).
Suggested fix
-------------
Audit lnetctl's error-return-handling code for stack-allocated buffer
sizes — specifically the path that formats / copies error messages
returned from the kernel via the LNet ioctl interface. Likely a
fixed-size on-stack buffer being written past its bounds when the
kernel-returned error string (or some structured field) exceeds the
expected length.
A targeted way to localize: rebuild lnetctl with -D_FORTIFY_SOURCE=2
(if not already on) and run the failing scenarios under valgrind or
ASAN. ASAN should pinpoint the overrun precisely.
This is ARM64 / glibc-on-Ubuntu-24.04 specific in my reproduction —
I have not tested whether the same code path overruns silently on
x86_64 with smaller stack-protector coverage. Either way the bug is
in the userland buffer handling, not architecture-specific in
principle.
Workarounds
-----------
Case A (tcp/lo): don't add tcp(lo). For single-node Lustre, use the
0@lo NID directly:
mkfs.lustre --mgsnode=0@lo ...
mount -t lustre 0@lo:/<fs> <mountpoint>
This sidesteps the broken code path entirely. Validated end-to-end in
the linked reproduce kit.
Case B (multi-rail o2ib1): no clean workaround in this build. Possible
path: configure LNet via modprobe params with full LNet teardown +
reload (untested in my run).
Impact
------
Case A is mostly cosmetic — there's a clean alternative configuration
(0@lo) that avoids the trigger.
Case B is more consequential — it blocks LNet multi-rail configuration
on ARM64 + DGX Spark hardware, where the second QSFP NIC's fabric
capacity is otherwise wasted. For a 2-NIC workstation-class platform
where multi-rail would roughly double the fabric throughput, this
takes the option off the table for anyone tracking master.
Reference
---------
Public write-up + reproduce kit (this bug surfaces in the multi-rail
attempt section):
https://github.com/knachiketa04/aihomelab/tree/main/artifacts/training/lustre-on-uma-workstations/reproduce
Thanks,
Kumar
[email protected]
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org