Hi all,

Third and last of three bugs from my personal-lab Lustre master build
run on DGX Spark / Ubuntu 24.04 ARM64 / kernel 6.17.0-1014-nvidia (full
context in the previous mails on KBUILD_EXTRA_SYMBOLS and
--with-o2ib=yes readlink).

This one is a memory-safety issue rather than a build issue, and it
blocks productive use of multi-rail LNet on ARM64 hardware where
multiple QSFP NICs are otherwise available.


Summary
-------

lnetctl (Lustre master 805cece6 / lctl 2.17.52_125_g805cece on ARM64,
glibc on Ubuntu 24.04) trips its glibc stack canary

  *** stack smashing detected ***: terminated

when handling certain error paths returned from kernel-side LNet. I've
reproduced this on at least two distinct cases, with different
kernel-side error returns, both terminating lnetctl userland the same
way. This pattern suggests a missing bounds check or stack-allocated
buffer overrun in lnetctl's error-path code rather than something
specific to either error.

Case A — adding tcp interface lo (ksocklnd rejection):

  sudo lnetctl net add --net tcp --if lo

  kernel logs:
    config.c:1591:lnet_inet_select() ksocklnd: failed to find UP
    interface lo

  lnetctl userland:
    *** stack smashing detected ***: terminated
    Aborted (core dumped)

Case B — multi-rail second o2ib add:

  sudo lnetctl net add --net o2ib1 --if enP2p1s0f0np0

  (after o2ib0 is already configured on the primary QSFP NIC; some
  o2ib-add error path is taken on the second-rail attempt)

  lnetctl userland:
    *** stack smashing detected ***: terminated
    Aborted (core dumped)

Two different kernel error returns, same lnetctl userland mishandling.


Environment
-----------

  Kernel:    6.17.0-1014-nvidia (NVIDIA-signed Ubuntu kernel)
  OS:        Ubuntu 24.04.1 LTS ARM64
  Lustre:    master @ 805cece6747f442449f32a1d25a8b8a03b230875
             (lctl 2.17.52_125_g805cece)
  glibc:     Ubuntu 24.04 stock (with stack-protector enabled)
  Hardware:  NVIDIA DGX Spark (Grace UMA workstation, ARM64)
             2x QSFP NICs (Mellanox in-kernel mlx5)


Reproduction
------------

For case A (simplest):

  # Build + install Lustre master per build instructions, then:
  sudo modprobe lnet
  sudo modprobe ksocklnd
  sudo lnetctl lnet configure
  sudo lnetctl net add --net tcp --if lo

  # Observe stack-smash abort in lnetctl; check dmesg for the
  # ksocklnd "failed to find UP interface lo" message.

For case B, the trigger is configuring a second o2ib network on a
second QSFP NIC after o2ib0 is already up. Same stack-smash abort
pattern in lnetctl.


Expected behavior
-----------------

When the kernel returns an error from a net-add operation, lnetctl
prints a clear diagnostic and exits cleanly with a non-zero status —
no stack canary trip, no abort.


Actual behavior
---------------

lnetctl's stack canary fires and the process is terminated by glibc.
The kernel-side state may or may not reflect the partial operation
depending on which case (kernel error returned before any state change
in case A; case B may have left partial state).


Suggested fix
-------------

Audit lnetctl's error-return-handling code for stack-allocated buffer
sizes — specifically the path that formats / copies error messages
returned from the kernel via the LNet ioctl interface. Likely a
fixed-size on-stack buffer being written past its bounds when the
kernel-returned error string (or some structured field) exceeds the
expected length.

A targeted way to localize: rebuild lnetctl with -D_FORTIFY_SOURCE=2
(if not already on) and run the failing scenarios under valgrind or
ASAN. ASAN should pinpoint the overrun precisely.

This is ARM64 / glibc-on-Ubuntu-24.04 specific in my reproduction —
I have not tested whether the same code path overruns silently on
x86_64 with smaller stack-protector coverage. Either way the bug is
in the userland buffer handling, not architecture-specific in
principle.


Workarounds
-----------

Case A (tcp/lo): don't add tcp(lo). For single-node Lustre, use the
0@lo NID directly:

  mkfs.lustre --mgsnode=0@lo ...
  mount -t lustre 0@lo:/<fs> <mountpoint>

This sidesteps the broken code path entirely. Validated end-to-end in
the linked reproduce kit.

Case B (multi-rail o2ib1): no clean workaround in this build. Possible
path: configure LNet via modprobe params with full LNet teardown +
reload (untested in my run).


Impact
------

Case A is mostly cosmetic — there's a clean alternative configuration
(0@lo) that avoids the trigger.

Case B is more consequential — it blocks LNet multi-rail configuration
on ARM64 + DGX Spark hardware, where the second QSFP NIC's fabric
capacity is otherwise wasted. For a 2-NIC workstation-class platform
where multi-rail would roughly double the fabric throughput, this
takes the option off the table for anyone tracking master.


Reference
---------

Public write-up + reproduce kit (this bug surfaces in the multi-rail
attempt section):

  
https://github.com/knachiketa04/aihomelab/tree/main/artifacts/training/lustre-on-uma-workstations/reproduce

Thanks,
Kumar
[email protected]
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to