Hi Felipe!
Once again me. Thank you very much for the hint
I did not open a PMR yet because I fear they will ask me/us if we are cracy ☹
I did not tell the full story yet
We have a 3 node cluster, 2 NSD servers o1,o2 (same site ) and g1 (different
site). (rhel 8.7)
All of them are Vmware VMs
O1 and o2 have each 4 NVME drives passed through , there is a software raid 5
made over these NVMEs , and from them made a single NSD , for a filesystem
fs4vm (m,r=2 )
[root@ogpfs1 ras]# mmlscluster
GPFS cluster information
========================
GPFS cluster name: edvdesign-cluster.local
GPFS cluster id: 12147978822727803186
GPFS UID domain: edvdesign-cluster.local
Remote shell command: /usr/bin/ssh
Remote file copy command: /usr/bin/scp
Repository type: CCR
Node Daemon node name IP address Admin node name Designation
----------------------------------------------------------------------------
1 ogpfs1-hs.local 10.20.30.1 ogpfs1-hs.local
quorum-manager-perfmon
2 ogpfs2-hs.local 10.20.30.2 ogpfs2-hs.local
quorum-manager-perfmon
3 ggpfsq.mgmt.cloudia xxxx.other.net ggpfsq.mgmt. a quorum-perfmon
[root@ogpfs1 ras]# mmlsconfig
Configuration data for cluster edvdesign-cluster.local:
-------------------------------------------------------
clusterName edvdesign-cluster.local
clusterId 12147978822727803186
autoload yes
profile gpfsProtocolRandomIO
dmapiFileHandleSize 32
minReleaseLevel 5.1.6.0
tscCmdAllowRemoteConnections no
ccrEnabled yes
cipherList AUTHONLY
sdrNotifyAuthEnabled yes
maxblocksize 16M
[cesNodes]
maxMBpS 5000
numaMemoryInterleave yes
enforceFilesetQuotaOnRoot yes
workerThreads 512
[common]
tscCmdPortRange 60000-61000
[srv]
verbsPorts mlx5_0/1 mlx5_1/1
[common]
cesSharedRoot /fs4vmware/cesSharedRoot
[srv]
maxFilesToCache 10000
maxStatCache 20000
[common]
verbsRdma enable
[ggpfsq]
verbsRdma disable
[common]
verbsRdmaSend yes
[ggpfsq]
verbsRdmaSend no
[common]
verbsRdmaCm enable
[ggpfsq]
verbsRdmaCm disable
[srv]
pagepool 32G
[common]
adminMode central
File systems in cluster edvdesign-cluster.local:
------------------------------------------------
/dev/fs4vm
[root@ogpfs1 ras]# mmlsdisk fs4vm -L
disk driver sector failure holds holds
storage
name type size group metadata data status
availability disk id pool remarks
------------ -------- ------ ----------- -------- ----- -------------
------------ ------- ------------ ---------
ogpfs1_1 nsd 512 1 yes yes ready up
1 system desc
ogpfs2_1 nsd 512 2 yes yes ready up
2 system desc
ggpfsq_qdisk nsd 512 -1 no no ready up
3 system desc
Number of quorum disks: 3
Read quorum value: 2
Write quorum value: 2
And the two nodes o1 and o2 export the filesystem via CES NFS functions ( for
VMware)
I think this isn´supported , that a NSD Server is also a CES Node?
And finally the RDMA Network :
The both NSD servers also have a Mellanox ConnectX-6 Lx dual port 25Gb adapter
also via passthrough
And these interfaces we configured for rdma (RoCE) ,
Last but not least: this network is not switched but direct attached ( 2x25Gb
directly connected between the NSD nodes )
RDMA Connections between nodes:
Fabric 0 - Device mlx5_0 Port 1 Width 1x Speed EDR lid 0
hostname idx CM state VS buff RDMA_CT(ERR)
RDMA_RCV_MB RDMA_SND_MB VS_CT(ERR) VS_SND_MB VS_RCV_MB WAIT_CON_SLOT
WAIT_NODE_SLOT
ogpfs2-hs.local 0 Y RTS (Y)256 478202 (0 ) 12728
67024 8864789(0 ) 22776 4643 0 0
Fabric 0 - Device mlx5_1 Port 1 Width 1x Speed EDR lid 0
hostname idx CM state VS buff RDMA_CT(ERR)
RDMA_RCV_MB RDMA_SND_MB VS_CT(ERR) VS_SND_MB VS_RCV_MB WAIT_CON_SLOT
WAIT_NODE_SLOT
ogpfs2-hs.local 1 Y RTS (Y)256 477659 (0 ) 12489
67034 8864773(0 ) 22794 4639 0 0
[root@ogpfs1 ras]#
You mentioned that it might be a cpu contention : Maybe due to the VM layer
(scheduling with other VMS) ? And wrong layout of VMs ( 8 vCPUs and 64GB Mem) [
esxis only single socket with 32/64 cores HT)
And also the direct attached RDMA ( +DAEMON) network is also not good?
Do you think IBM would say no to check such a configuration ?
Best regards
Walter
From: gpfsug-discuss <[email protected]> On Behalf Of Felipe
Knop
Sent: Mittwoch, 15. Februar 2023 15:59
To: gpfsug main discussion list <[email protected]>
Subject: Re: [gpfsug-discuss] Reasons for DiskLeaseThread Overloaded
Walter,
Thanks for the details.
The stack trace below captures the lease thread in the middle of sending the
“lease” RPC. This operation normally is not blocking, and we do not often block
while sending the RPC. But the stack trace “does not show” whether there was
anything blocking the thread prior to the point where the RPCs are sent.
At a first glance:
2023-02-14_19:44:07.430+0100: [W] counter: 0 (mark-idle: 0 mark-active: 0
pre-work: 0 post-work: 0) sched: (nvcsw: 0 nivcsw: 10)
I believe nivcsw: 10 means that the thread was scheduled out of the CPU
involuntarily, possibly indicating that there is some CPU contention going on.
Could you open a case to get debug data collected? If the problem can be
recreated, I think we’ll need a recreate of the problem with traces enabled.
Thanks,
Felipe
----
Felipe Knop [email protected]<mailto:[email protected]>
GPFS Development and Security
IBM Systems
IBM Building 008
2455 South Rd, Poughkeepsie, NY 12601
From: gpfsug-discuss
<[email protected]<mailto:[email protected]>>
on behalf of Walter Sklenka
<[email protected]<mailto:[email protected]>>
Reply-To: gpfsug main discussion list
<[email protected]<mailto:[email protected]>>
Date: Wednesday, February 15, 2023 at 4:23 AM
To: gpfsug main discussion list
<[email protected]<mailto:[email protected]>>
Subject: [EXTERNAL] Re: [gpfsug-discuss] Reasons for DiskLeaseThread Overloaded
Hi! This is a „full“ sequence in mmfs. log. latest Fortunately this was also
the last event until now (yesterday evening) Maybe you can have a look?
2023-02-14_19: 43: 51. 474+0100: [N] Disk lease period expired 0. 030 seconds
ago in cluster
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Hi!
This is a „full“ sequence in mmfs.log.latest
Fortunately this was also the last event until now (yesterday evening)
Maybe you can have a look?
2023-02-14_19:43:51.474+0100: [N] Disk lease period expired 0.030 seconds ago
in cluster xxx-cluster. Attempting to reacquire the lease.
2023-02-14_19:44:07.430+0100: [W] ------------------[GPFS Critical Thread
Watchdog]------------------
2023-02-14_19:44:07.430+0100: [W] PID: 7294 State: R (DiskLeaseThread) is
overloaded for more than 8 seconds
2023-02-14_19:44:07.430+0100: [W] counter: 0 (mark-idle: 0 mark-active: 0
pre-work: 0 post-work: 0) sched: (nvcsw: 0 nivcsw: 10)
2023-02-14_19:44:07.430+0100: [W] Call Trace(PID: 7294):
2023-02-14_19:44:07.431+0100: [W] #0: 0x000055CABE4A56AB
NodeConn::sendMessage(TcpConn**, iovec*, int, unsigned char, int, int, int,
unsigned int, DestTag*, int*, unsigned long long*, unsigned long long*, unsi
gned int*, CondvarName, vsendCallback_t*) + 0x42B at ??:0
2023-02-14_19:44:07.432+0100: [W] #1: 0x000055CABE4A595F
llc_send_msg(ClusterConfiguration*, NodeAddr, iovec*, int, unsigned char, int,
int, int, unsigned int, DestTag*, int*, TcpConn**, unsigned long long*, u
nsigned long long*, unsigned int*, CondvarName, vsendCallback_t*, int, unsigned
int) + 0xDF at ??:0
2023-02-14_19:44:07.437+0100: [W] #2: 0x000055CABE479A55 MsgRecord::send() +
0x1345 at ??:0
2023-02-14_19:44:07.438+0100: [W] #3: 0x000055CABE47A169
tscSendInternal(ClusterConfiguration*, unsigned int, unsigned char, int, int,
NodeAddr*, TscReply*, TscScatteredBuff*, int, int (*)(void*, ClusterConfig
uration*, int, NodeAddr*, TscReply*), void*, ChainedCallback**, __va_list_tag*)
+ 0x339 at ??:0
2023-02-14_19:44:07.439+0100: [W] #4: 0x000055CABE47C39A
tscSendWithCallback(ClusterConfiguration*, unsigned int, unsigned char, int,
NodeAddr*, TscReply*, int (*)(void*, ClusterConfiguration*, int, NodeAddr*,
TscReply*), void*, void**, int, ...) + 0x1DA at ??:0
2023-02-14_19:44:07.440+0100: [W] #5: 0x000055CABE5F9853
MyLeaseState::renewLease(NodeAddr, TickTime) + 0x6E3 at ??:0
2023-02-14_19:44:07.440+0100: [W] #6: 0x000055CABE5FA682
ClusterConfiguration::checkAndRenewLease(TickTime) + 0x192 at ??:0
2023-02-14_19:44:07.441+0100: [W] #7: 0x000055CABE5FAAC6
ClusterConfiguration::RunLeaseChecks(void*) + 0x366 at ??:0
2023-02-14_19:44:07.441+0100: [W] #8: 0x000055CABDF2B662
Thread::callBody(Thread*) + 0x42 at ??:0
2023-02-14_19:44:07.441+0100: [W] #9: 0x000055CABDF18680
Thread::callBodyWrapper(Thread*) + 0xA0 at ??:0
2023-02-14_19:44:07.441+0100: [W] #10: 0x00007F3B7563D1CA start_thread + 0xEA
at ??:0
2023-02-14_19:44:07.441+0100: [W] #11: 0x00007F3B7435BE73 __GI___clone + 0x43
at ??:0
2023-02-14_19:44:10.512+0100: [N] Disk lease reacquired in cluster xxx-cluster.
2023-02-14_19:44:10.512+0100: [N] Disk lease period expired 7.970 seconds ago
in cluster xxx-cluster. Attempting to reacquire the lease.
2023-02-14_19:44:12.563+0100: [N] Disk lease reacquired in cluster xxx-cluster.
Thank you very much!
Best regards
Walter
From: gpfsug-discuss
<[email protected]<mailto:[email protected]>>
On Behalf Of Felipe Knop
Sent: Mittwoch, 15. Februar 2023 00:06
To: gpfsug main discussion list
<[email protected]<mailto:[email protected]>>
Subject: Re: [gpfsug-discuss] Reasons for DiskLeaseThread Overloaded
All,
These messages like
[W] ------------------[GPFS Critical Thread Watchdog]------------------
indicate that a “critical thread”, in this case the lease thread, was
apparently blocked for longer than expected. This is usually not caused by
delays in the network, but possibly by excessive CPU load, blockage while
accessing the local file system, or possible mutex contention.
Do you have other samples of the message, with a more complete stack trace?
Or was the instance below the only one?
Felipe
----
Felipe Knop [email protected]<mailto:[email protected]>
GPFS Development and Security
IBM Systems
IBM Building 008
2455 South Rd, Poughkeepsie, NY 12601
From: gpfsug-discuss
<[email protected]<mailto:[email protected]>>
on behalf of Walter Sklenka
<[email protected]<mailto:[email protected]>>
Reply-To: gpfsug main discussion list
<[email protected]<mailto:[email protected]>>
Date: Tuesday, February 14, 2023 at 10:49 AM
To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: [EXTERNAL] Re: [gpfsug-discuss] Reasons for DiskLeaseThread Overloaded
Hi! I started with 5. 1. 6. 0 and now am at [root@ ogpfs1 ~]# mmfsadm dump
version Dump level: verbose Build branch "5. 1. 6. 1 ". the messages started
from the beginning From: gpfsug-discuss <gpfsug-discuss-bounces@ gpfsug. org> On
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Hi!
I started with 5.1.6.0 and now am at [root@ogpfs1 ~]# mmfsadm dump version
Dump level: verbose
Build branch "5.1.6.1 ".
the messages started from the beginning
From: gpfsug-discuss
<[email protected]<mailto:[email protected]>>
On Behalf Of Christian Vieser
Sent: Dienstag, 14. Februar 2023 15:34
To: [email protected]<mailto:[email protected]>
Subject: Re: [gpfsug-discuss] Reasons for DiskLeaseThread Overloaded
What version of Spectrum Scale is running there? Do these errors appear since
your last version update?
Am 14.02.23 um 14:09 schrieb Walter Sklenka:
Dear Collegues!
May I ask if anyone has a hint what could be the reason for Critical Thread
Watchdog warnings for Disk Leases Threads?
Is this a “local node” Problem or a network problem ?
I see these messages sometimes arriving when NSD Servers which also serve as
NFS servers when they get under heavy NFS load
Following is an excerpt from mmfs.log.latest
2023-02-14_12:06:53.235+0100: [N] Disk lease period expired 0.040 seconds ago
in cluster xxx-cluster. Attempting to reacquire the lease.
2023-02-14_12:06:53.600+0100: [W] ------------------[GPFS Critical Thread
Watchdog]------------------
2023-02-14_12:06:53.600+0100: [W] PID: 7294 State: R (DiskLeaseThread) is
overloaded for more than 8 seconds
2023-02-14_12:06:53.600+0100: [W] counter: 0 (mark-idle: 0 mark-active: 0
pre-work: 0 post-work: 0) sched: (nvcsw: 0 nivcsw: 8)
2023-02-14_12:06:53.600+0100: [W] Call Trace(PID: 7294):
2023-02-14_12:06:53.600+0100: [W] #0: 0x000055CABDF49521
BaseMutexClass::release() + 0x12 at ??:0
2023-02-14_12:06:53.600+0100: [W] #1: 0xB1557721BBABD900 _etext +
0xB154F7E646041C0E at ??:0
2023-02-14_12:07:09.554+0100: [N] Disk lease reacquired in cluster xxx-cluster.
2023-02-14_12:07:09.554+0100: [N] Disk lease period expired 5.680 seconds ago
in cluster xxx-cluster. Attempting to reacquire the lease.
2023-02-14_12:07:11.605+0100: [N] Disk lease reacquired in cluster xxx-cluster.
2023-02-14_12:10:55.990+0100: [I] Command: mmlspool /dev/fs4vm all -L -Y
2023-02-14_12:10:55.990+0100: [I] Command: successful mmlspool /dev/fs4vm all
-L -Y
2023-02-14_12:30:58.756+0100: [I] Command: mmlspool /dev/fs4vm all -L -Y
2023-02-14_12:30:58.756+0100: [I] Command: successful mmlspool /dev/fs4vm all
-L -Y
2023-02-14_13:10:55.988+0100: [I] Command: mmlspool /dev/fs4vm all -L -Y
2023-02-14_13:10:55.989+0100: [I] Command: successful mmlspool /dev/fs4vm all
-L -Y
2023-02-14_13:21:40.892+0100: [N] Node 10.20.30.2 (ogpfs2-hs.local) lease
renewal is overdue. Pinging to check if it is alive
2023-02-14_13:21:40.892+0100: [I] The TCP connection to IP address 10.20.30.2
ogpfs2-hs.local <c0n1>:[1] (socket 106) state: state=1 ca_state=0 snd_cwnd=10
snd_ssthresh=2147483647 unacked=0 probes=0 backoff=0 retransmits=0 rto=201000
rcv_ssthresh=1219344 rtt=121 rttvar=69 sacked=0 retrans=0 reordering=3 lost=0
2023-02-14_13:22:00.220+0100: [N] Disk lease period expired 0.010 seconds ago
in cluster xxx-cluster. Attempting to reacquire the lease.
2023-02-14_13:22:08.298+0100: [N] Disk lease reacquired in cluster xxx-cluster.
2023-02-14_13:30:58.760+0100: [I] Command: mmlspool /dev/fs4vm all -L -Y
2023-02-14_13:30:58.760+0100: [I] Command: successful mmlspool /dev/fs4vm all
-L -Y
Mit freundlichen Grüßen
Walter Sklenka
Technical Consultant
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org