I have seen these on our cluster after the IB network goes down (GPFS still 
runs over ethernet) and then comes back up. 
They will retry forever it seems, even after the IB is healthy again.  The 
effect they seem to have is that verbs
connections between some nodes breaks and GPFS uses ethernet/ipoib instead.  
You may see messages in your
mmfs.log.latest about verbs being disabled "due to too many errors".  You can 
also see fewer verbs connections between
nodes in "mmfsadm test verbs conn" output.

Restarting GPFS on the nodes with waiters has fixed the issue for me, I don't 
know if IBM has any other tricks to fix
this without a restart.

--Joey


On 9/12/19 8:16 AM, Damir Krstic wrote:
> On my cluster I have seen couple of long waiters such as this:
>
> gss01: Waiting 16.8543 sec since 09:07:02, ignored, thread 46230 
> VerbsReconnectThread: delaying for 43.145624000 more
> seconds, reason: delaying for next reconnect attempt
>
> I tried searching on gpfs wiki for this type of waiter, but was unable to 
> find anything of value.
>
> Is this something to pay attention to, and what does this waiter mean?
>
> Thank you.
> Damir
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to