Hi all,




I'd like to report a kernel panic we encountered at a customer site running 
Lustre 2.15.6 (also reproduced on 2.17.0).





The issue has been filed on Whamcloud Jira: LU-20227





---

Summary

When a Lustre snapshot MDT is mounted in a DNE configuration, lod_statfs() 
iterates sub-MDT OSPs via lod_foreach_mdt. If a sub-MDT OSP returns rc=0 with 
an uninitialized opd_statfs cache (os_bsize=0), lod_statfs_sum() silently 
shifts sfs->os_bsize down to zero.

This corrupted value is sent to the client with OS_STATFS_SUM set. When project 
quota is active and a block hard limit is set, ll_statfs_project() divides by 
sfs->f_bsize=0, triggering a CPU #DE (Divide Error) exception and kernel 
panic.

The panic was triggered simply by running linux du command on a directory under 
the snapshot mount.

---

Environment

  - Lustre: 2.15.6 / 2.17.0 (we currently use 2.15.6 and tested also 2.17.0)
  - Kernel: 4.18.0-553.5.1
  - Backend FS: ZFS
  - Topology: MGS, MDT x3, OST x5, project quota enabled

---

Reproduction Conditions (all five must hold)

  1. Snapshot client mounted
  2. Snapshot MDT has sub-MDTs (DNE configuration)
  3. Sub-MDT OSP opd_statfs cache uninitialized (os_bsize=0)
  4. Project quota enabled
  5. Project block hard limit > 0 on the target directory

---

Root Cause

BUG 1 (server): lod_foreach_mdt lacks the os_bsize==0 guard that 
lod_foreach_ost already has. The uninitialized os_bsize silently corrupts 
sfs->os_bsize to zero.

BUG 2 (client): ll_statfs_project() unconditionally divides by sfs->f_bsize 
without validating it is non-zero.

---

Proposed Fix

Server side (main fix):

  - if (rc)
  + if (rc || ost_sfs.os_bsize == 0)

Client side (defensive):
  Add an f_bsize == 0 guard in ll_statfs_project() before the division, falling 
back to 1 
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to