[gpfsug-discuss] strange waiters + filesystem deadlock

Aaron Knister Fri, 24 Mar 2017 09:43:24 -0700

Since yesterday morning we've noticed some deadlocks on one of ourfilesystems that seem to be triggered by writing to it. The waiters onthe clients look like this:

0x19450B0 ( 6730) waiting 2063.294589599 seconds, SyncHandlerThread:on ThCond 0x1802585CB10 (0xFFFFC9002585CB10) (InodeFlushCondVar), reason'waiting for the flush flag to commit metadata'0x7FFFDA65E200 ( 22850) waiting 0.000246257 seconds,AllocReduceHelperThread: on ThCond 0x7FFFDAC7FE28 (0x7FFFDAC7FE28)(MsgRecordCondvar), reason 'RPC wait' for allocMsgTypeRelinquishRegionon node 10.1.52.33 <c0n3271>0x197EE70 ( 6776) waiting 0.000198354 seconds,FileBlockWriteFetchHandlerThread: on ThCond 0x7FFFF00CD598(0x7FFFF00CD598) (MsgRecordCondvar), reason 'RPC wait' forallocMsgTypeRequestRegion on node 10.1.52.33 <c0n3271>


(10.1.52.33/c0n3271 is the fs manager for the filesystem in question)

there's a single process running on this node writing to the filesystemin question (well, trying to write, it's been blocked doing nothing forhalf an hour now). There are ~10 other client nodes in this situationright now. We had many more last night before the problem seemed todisappear in the early hours of the morning and now its back.

Waiters on the fs manager look like this. While the individual waiter isshort it's a near constant stream:

0x7FFF60003540 ( 8269) waiting 0.001151588 seconds, Msg handlerallocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0 (0xFFFFC9002163A2E0)(AllocManagerMutex)0x7FFF601C8860 ( 20606) waiting 0.001115712 seconds, Msg handlerallocMsgTypeRelinquishRegion: on ThMutex 0x1802163A2E0(0xFFFFC9002163A2E0) (AllocManagerMutex)0x7FFF91C10080 ( 14723) waiting 0.000959649 seconds, Msg handlerallocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0 (0xFFFFC9002163A2E0)(AllocManagerMutex)0x7FFFB03C2910 ( 12636) waiting 0.000769611 seconds, Msg handlerallocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0 (0xFFFFC9002163A2E0)(AllocManagerMutex)0x7FFF8C092850 ( 18215) waiting 0.000682275 seconds, Msg handlerallocMsgTypeRelinquishRegion: on ThMutex 0x1802163A2E0(0xFFFFC9002163A2E0) (AllocManagerMutex)0x7FFF9423F730 ( 12652) waiting 0.000641915 seconds, Msg handlerallocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0 (0xFFFFC9002163A2E0)(AllocManagerMutex)0x7FFF9422D770 ( 12625) waiting 0.000494256 seconds, Msg handlerallocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0 (0xFFFFC9002163A2E0)(AllocManagerMutex)0x7FFF9423E310 ( 12651) waiting 0.000437760 seconds, Msg handlerallocMsgTypeRelinquishRegion: on ThMutex 0x1802163A2E0(0xFFFFC9002163A2E0) (AllocManagerMutex)

I don't know if this data point is useful but both yesterday and todaythe metadata NSDs for this filesystem have had a constant aggregatestream of 25MB/s 4kop/s reads during each episode (very low latencythough so I don't believe the storage is a bottleneck here). Writes areonly a few hundred ops and didn't strike me as odd.

I have a PMR open for this but I'm curious if folks have seen this inthe wild and what it might mean.


-Aaron

--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

[gpfsug-discuss] strange waiters + filesystem deadlock

Reply via email to