Re: [gpfsug-discuss] FS freeze on client nodes with nbCores>workerThreads

Aaron Knister Mon, 26 Jun 2017 16:58:39 -0700

That's a fascinating bug. When the node is locked up what does "mmdiag--waiters" show from the node in question? I suspect there's morelow-level diagnostic data that's helpful for the gurus at IBM but I'mjust curious what the waiters look like.


-Aaron


On 6/26/17 3:49 AM, CAPIT, NICOLAS wrote:

Hello,
I don't know if this behavior/bug was already reported on this ML, so indoubt.
Context:

   - SpectrumScale 4.2.2-3
   - client node with 64 cores
   - OS: RHEL7.3
When a MPI job with 64 processes is launched on the node with 64 coresthen the FS freezed (only the output log file of the MPI job is put onthe GPFS; so it may be related to the 64 processes writing in a samefile???).
   strace -p 3105         # mmfsd pid stucked
   Process 3105 attached
   wait4(-1,              # stucked at this point

   strace ls /gpfs
   stat("/gpfs", {st_mode=S_IFDIR|0755, st_size=131072, ...}) = 0
openat(AT_FDCWD, "/gpfs", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC# stucked at this point
I have no problem with the other nodes of 28 cores.
The GPFS command mmgetstate is working and I am able to use mmshutdownto recover the node.
If I put workerThreads=72 on the 64 core node then I am not able toreproduce the freeze and I get the right behavior.
Is this a known bug with a number of cores > workerThreads?

Best regards,
--
*Nicolas Capit*


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] FS freeze on client nodes with nbCores>workerThreads

Reply via email to