Hi Heiner, I'm not really able to give you insights into the decision of the events' states. Maybe somebody else is able to answer here.
But about your triggering debug data collection question, please have a look at this documentation page: https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.4/com.ibm.spectrum.scale.v5r04.doc/bl1adv_createscriptforevents.htm This feature is in the product since the 5.0.x versions and should be helpful here. It will trigger your eventsCallback script when the event is raised. One of the script's arguments is the event name. So it is possible to create a script, that checks for the event name longwaiters_found and then triggers a mmdiag --deadlock and write it into a txt file. The script call has a hard time out of 60 seconds so it does not interfere too much with the mmsysmon internals, but better would be a run time less than 1 second. Mit freundlichen Grüßen / Kind regards Anna Greim Software Engineer, Spectrum Scale Development IBM Systems IBM Data Privacy Statement IBM Deutschland Research & Development GmbH / Vorsitzender des Aufsichtsrats: Gregor Pillen Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: "Billich Heinrich Rainer (ID SD)" <[email protected]> To: gpfsug main discussion list <[email protected]> Date: 16/04/2020 10:36 Subject: [EXTERNAL] [gpfsug-discuss] Mmhealth events longwaiters_found and deadlock_detected Sent by: [email protected] Hello, I?m puzzled about the difference between the two mmhealth events longwaiters_found ERROR Detected Spectrum Scale long-waiters and deadlock_detected WARNING The cluster detected a Spectrum Scale filesystem deadlock Especially why the later has level WARNING only while the first has level ERROR? Longwaiters_found is based on the output of ?mmdiag ?deadlock? and occurs much more often on our clusters, while the later probably is triggered by an external event and no internal mmsysmon check? Deadlock detection is handled by mmfsd? Whenever a deadlock is detected some debug data is collected, which is not true for longwaiters_detected. Hm, so why is no deadlock detected whenever mmdiag ?deadlock shows waiting threads? Shouldn?t the severity be the opposite way? Finally: Can we trigger some debug data collection whenever a longwaiters_found event happens ? just getting the output of ?mmdiag ?deadlock? on the single node could give some hints. Without I don?t see any real chance to take any action. Thank you, Heiner -- ======================= Heinrich Billich ETH Zürich Informatikdienste Tel.: +41 44 632 72 56 [email protected] ======================== _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=XLDdnBDnIn497KhM7_npStR6ig1r198VHeSBY1WbuHc&m=QAa_5ZRNpy310ikXZzwunhWU4TGKsH_NWDoYwh57MNo&s=dKWX1clbfClbfJb5yKSzhoNC1aqCbT6-7s1DQdx8CzY&e=
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
