I have seen these on our cluster after the IB network goes down (GPFS still runs over ethernet) and then comes back up. They will retry forever it seems, even after the IB is healthy again. The effect they seem to have is that verbs connections between some nodes breaks and GPFS uses ethernet/ipoib instead. You may see messages in your mmfs.log.latest about verbs being disabled "due to too many errors". You can also see fewer verbs connections between nodes in "mmfsadm test verbs conn" output.
Restarting GPFS on the nodes with waiters has fixed the issue for me, I don't know if IBM has any other tricks to fix this without a restart. --Joey On 9/12/19 8:16 AM, Damir Krstic wrote: > On my cluster I have seen couple of long waiters such as this: > > gss01: Waiting 16.8543 sec since 09:07:02, ignored, thread 46230 > VerbsReconnectThread: delaying for 43.145624000 more > seconds, reason: delaying for next reconnect attempt > > I tried searching on gpfs wiki for this type of waiter, but was unable to > find anything of value. > > Is this something to pay attention to, and what does this waiter mean? > > Thank you. > Damir > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
