That's a fascinating bug. When the node is locked up what does "mmdiag
--waiters" show from the node in question? I suspect there's more
low-level diagnostic data that's helpful for the gurus at IBM but I'm
just curious what the waiters look like.
-Aaron
On 6/26/17 3:49 AM, CAPIT, NICOLAS wrote:
Hello,
I don't know if this behavior/bug was already reported on this ML, so in
doubt.
Context:
- SpectrumScale 4.2.2-3
- client node with 64 cores
- OS: RHEL7.3
When a MPI job with 64 processes is launched on the node with 64 cores
then the FS freezed (only the output log file of the MPI job is put on
the GPFS; so it may be related to the 64 processes writing in a same
file???).
strace -p 3105 # mmfsd pid stucked
Process 3105 attached
wait4(-1, # stucked at this point
strace ls /gpfs
stat("/gpfs", {st_mode=S_IFDIR|0755, st_size=131072, ...}) = 0
openat(AT_FDCWD, "/gpfs", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC
# stucked at this point
I have no problem with the other nodes of 28 cores.
The GPFS command mmgetstate is working and I am able to use mmshutdown
to recover the node.
If I put workerThreads=72 on the 64 core node then I am not able to
reproduce the freeze and I get the right behavior.
Is this a known bug with a number of cores > workerThreads?
Best regards,
--
*Nicolas Capit*
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss