We think our issue was down to numa settings actually - making mmfsd allocate GPU memory. Makes sense given the type of error.
Tomer suggested to Simon we set numactlOptioni to "0 8", as per: https://www-01.ibm.com/support/docview.wss?uid=isg1IJ02794 Our tests are not crashing since setting then – we need to roll it out on all nodes to confirm its fixed all our hangs/reboots. Cheers, Luke -- Luke Sudbery Architecture, Infrastructure and Systems Advanced Research Computing, IT Services Room 132, Computer Centre G5, Elms Road Please note I don’t work on Monday and work from home on Friday. From: [email protected] <[email protected]> On Behalf Of [email protected] Sent: 19 September 2019 22:35 To: [email protected] Cc: [email protected] Subject: Re: [gpfsug-discuss] GPFS and POWER9 Simon, I have an open support call that required Redhat to create a kernel patch for RH 7.6 because of issues with the Intel x710 network adapter - I can't tell you if its related to your issue or not but it would cause the GPFS cluster to reboot and the affected node to reboot if we tried to do almost anything with that intel adapter regards, Andrew Beattie File and Object Storage Technical Specialist - A/NZ IBM Systems - Storage Phone: 614-2133-7927 E-mail: [email protected]<mailto:[email protected]> ----- Original message ----- From: Simon Thompson <[email protected]<mailto:[email protected]>> Sent by: [email protected]<mailto:[email protected]> To: gpfsug main discussion list <[email protected]<mailto:[email protected]>> Cc: Subject: [EXTERNAL] Re: [gpfsug-discuss] GPFS and POWER9 Date: Fri, Sep 20, 2019 1:18 AM Hi Andrew, Yes, but not only. We use the two SFP+ ports from the Broadcom supplied card + the bifurcated Mellanox card in them. Simon From: <[email protected]<mailto:[email protected]>> on behalf of "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Thursday, 19 September 2019 at 11:45 To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: [gpfsug-discuss] GPFS and POWER9 Simon, are you using Intel 10Gb Network Adapters with RH 7.6 by anychance? regards Andrew Beattie File and Object Storage Technical Specialist - A/NZ IBM Systems - Storage Phone: 614-2133-7927 E-mail: [email protected]<mailto:[email protected]> ----- Original message ----- From: Simon Thompson <[email protected]<mailto:[email protected]>> Sent by: [email protected]<mailto:[email protected]> To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Cc: Subject: [EXTERNAL] [gpfsug-discuss] GPFS and POWER9 Date: Thu, Sep 19, 2019 8:42 PM Recently we’ve been having some issues with some of our POWER9 systems. They are occasionally handing or rebooting, in one case, we’ve found we can cause them to do it by running some MPI IOR workload to GPFS. Every instance we’ve seen which has logged something to syslog has had mmfsd referenced, but we don’t know if that is a symptom or a cause. (sometimes they just hang and we don’t see such a message) We see the following in the kern log: Sep 18 18:45:14 bear-pg0306u11a kernel: Hypervisor Maintenance interrupt [Recovered] Sep 18 18:45:14 bear-pg0306u11a kernel: Error detail: Malfunction Alert Sep 18 18:45:14 bear-pg0306u11a kernel: #011HMER: 8040000000000000 Sep 18 18:45:14 bear-pg0306u11a kernel: #011Unknown Malfunction Alert of type 3 Sep 18 18:45:14 bear-pg0306u11a kernel: Hypervisor Maintenance interrupt [Recovered] Sep 18 18:45:14 bear-pg0306u11a kernel: Error detail: Malfunction Alert Sep 18 18:45:14 bear-pg0306u11a kernel: #011HMER: 8040000000000000 Sep 18 18:45:14 bear-pg0306u11a kernel: Severe Machine check interrupt [Not recovered] Sep 18 18:45:14 bear-pg0306u11a kernel: NIP: [00000000115a2478] PID: 141380 Comm: mmfsd Sep 18 18:45:14 bear-pg0306u11a kernel: Initiator: CPU Sep 18 18:45:14 bear-pg0306u11a kernel: Error type: UE [Load/Store] Sep 18 18:45:14 bear-pg0306u11a kernel: Effective address: 000003002a2a8400 Sep 18 18:45:14 bear-pg0306u11a kernel: Physical address: 000003c016590000 Sep 18 18:45:14 bear-pg0306u11a kernel: Severe Machine check interrupt [Not recovered] Sep 18 18:45:14 bear-pg0306u11a kernel: NIP: [000000001150b160] PID: 141380 Comm: mmfsd Sep 18 18:45:14 bear-pg0306u11a kernel: Initiator: CPU Sep 18 18:45:14 bear-pg0306u11a kernel: Error type: UE [Instruction fetch] Sep 18 18:45:14 bear-pg0306u11a kernel: Effective address: 000000001150b160 Sep 18 18:45:14 bear-pg0306u11a kernel: Physical address: 000003c01fe80000 Sep 18 18:45:14 bear-pg0306u11a kernel: Severe Machine check interrupt [Not recovered] Sep 18 18:45:14 bear-pg0306u11a kernel: NIP: [000000001086a7f0] PID: 25926 Comm: mmfsd Sep 18 18:45:14 bear-pg0306u11a kernel: Initiator: CPU Sep 18 18:45:14 bear-pg0306u11a kernel: Error type: UE [Instruction fetch] Sep 18 18:45:14 bear-pg0306u11a kernel: Effective address: 000000001086a7f0 Sep 18 18:45:14 bear-pg0306u11a kernel: Physical address: 000003c00fe70000 Sep 18 18:45:14 bear-pg0306u11a kernel: mmfsd[25926]: unhandled signal 7 at 000000001086a7f0 nip 000000001086a7f0 lr 000000001086a7f0 code 4 I’ve raised a hardware ticket with IBM, as traditionally a machine check exception would likely be a hardware/firmware issue. Anyone else seen this sort of behaviour? Its multiple boxes doing this, but they do all have the same firmware/rhel/gpfs stack installed. Asking here as they always reference mmfsd PIDs … (but maybe it’s a symptom rather than cause)… Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
