Hallo Olu,
from the log you provide, nothing seems to be faulty... but that does not mean, there is no issue ...
 
if you think , it is a GPFS problem.... start gpfs trace on a sample node, , which has this problem again and again... and capture a trae as well and provide that data to IBM
I suggest, to open a PMR to IBM , collect a GPFS snap ...
 
personally, I would start debugging the node... make journalctl  persistent
and start from there ...
 
it smells a bit like a network  problem related to RDMA/ OFED.. do you use same OFED version as in the cluster, what works fine ?
 
 
 
----- Original message -----
From: "Saula, Oluwasijibomi" <oluwasijibomi.sa...@ndsu.edu>
Sent by: gpfsug-discuss-boun...@spectrumscale.org
To: "gpfsug-discuss@spectrumscale.org" <gpfsug-discuss@spectrumscale.org>
Cc:
Subject: [EXTERNAL] [gpfsug-discuss] Filesystem mount attempt hangs GPFS client node
Date: Mon, Mar 29, 2021 8:38 PM
 
Hello Folks,
 
So we are experiencing a mind-boggling issue where just a couple of nodes in our cluster, at GPFS boot up, get hung so badly that the node must be power reset.
 
These AMD client nodes are diskless in nature and have at least 256G of memory. We have other AMD nodes that are working just fine in a separate GPFS cluster albeit on RHEL7.
 
Just before GPFS (or related processes) seize up the node, the following lines of /var/mmfs/gen/mmfslog are noted:
 

2021-03-29_12:47:37.343-0500: [N] mmfsd ready

2021-03-29_12:47:37.426-0500: mmcommon mmfsup invoked. Parameters: 10.12.50.47 10.12.50.242 all

2021-03-29_12:47:37.587-0500: mounting /dev/mmfs1

2021-03-29_12:47:37.590-0500: [I] Command: mount mmfs1

2021-03-29_12:47:37.859-0500: [N] Connecting to 10.12.50.243 tier1-sn-02.pixstor <c0n2>

2021-03-29_12:47:37.864-0500: [I] VERBS RDMA connecting to 10.12.50.242 (tier1-sn-01.pixstor) on mlx5_0 port 1 fabnum 0 sl 0 index 0

2021-03-29_12:47:37.864-0500: [I] VERBS RDMA connecting to 10.12.50.242 (tier1-sn-01) on mlx5_0 port 1 fabnum 0 sl 0 index 1

2021-03-29_12:47:37.866-0500: [I] VERBS RDMA connected to 10.12.50.242 (tier1-sn-01) on mlx5_0 port 1 fabnum 0 sl 0 index 0

2021-03-29_12:47:37.867-0500: [I] VERBS RDMA connected to 10.12.50.242 (tier1-sn-01) on mlx5_0 port 1 fabnum 0 sl 0 index 1

2021-03-29_12:47:37.868-0500: [I] Connected to 10.12.50.243 tier1-sn-02 <c0n2>

There have been hunches that this might be a network issue, however, other nodes connected to the IB network switch are mounting the filesystem without incident.
 
I'm inclined to believe there's a GPFS/OS-specific setting that might be causing these crashes especially when we note that disabling the automount on the client node doesn't result in the node hanging. However, once we issue mmmount, we see the node seize up shortly...
 
Please let me know if you have any thoughts on where to look for root-causes as I and a few fellows are stuck here 🙁
 
 
 

 

Thanks,
 

Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology

 

Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu

 

 
 

 

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss 
 

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to