OPA behaves _significantly_ differently from Mellanox IB. OPA uses the host CPU for packet processing, whereas Mellanox IB uses a discrete asic on the HBA. As a result, OPA is much more sensitive to task placement and interrupts, in our experience, because the host CPU load competes with the fabric IO processing load.
~jonathon On 7/19/17, 12:12 PM, "[email protected] on behalf of [email protected]" <[email protected] on behalf of [email protected]> wrote: We have FDR14 Mellanox fabric, probably similar interrupt load as OPA. -- ddj Dave Johnson On Jul 19, 2017, at 1:52 PM, Jonathon A Anderson <[email protected]> wrote: >> It might be a problem specific to your system environment or a wrong configuration therefore please get in contact with IBM support to analyze the root cause of the high usage. > > I suspect it’s actually a result of frequent IO interrupts causing jitter in conflict with MPI on the shared Intel Omni-Path network, in our case. > > We’ve already tried pursuing support on this through our vendor, DDN, and got no-where. Eventually we were the ones who tried killing mmsysmon, and that fixed our problem. > > The official company line of “we don't see significant CPU consumption by mmsysmon on our test systems” isn’t helping. Do you have a test system with OPA? > > ~jonathon > > > On 7/19/17, 7:05 AM, "[email protected] on behalf of Mathias Dietz" <[email protected] on behalf of [email protected]> wrote: > > thanks for the feedback. > > Let me clarify what mmsysmon is doing. > Since IBM Spectrum Scale 4.2.1 the mmsysmon process is used for the overall health monitoring and CES failover handling. > Even without CES it is an essential part of the system because it monitors the individual components and provides health state information and error events. > > This information is needed by other Spectrum Scale components (mmhealth command, the IBM Spectrum Scale GUI, Support tools, Install Toolkit,..) and therefore disabling mmsysmon will impact them. > > >> It’s a huge problem. I don’t understand why it hasn’t been given > >> much credit by dev or support. > > Over the last couple of month, the development team has put a strong focus on this topic. > > In order to monitor the health of the individual components, mmsysmon listens for notifications/callback but also has to do some polling. > We are trying to reduce the polling overhead constantly and replace polling with notifications when possible. > > > Several improvements have been added to 4.2.3, including the ability to configure the polling frequency to reduce the overhead. (mmhealth config interval) > > See https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1adm_mmhealth.htm > In addition a new option has been introduced to clock align the monitoring threads in order to reduce CPU jitter. > > > Nevertheless, we don't see significant CPU consumption by mmsysmon on our test systems. > > It might be a problem specific to your system environment or a wrong configuration therefore please get in contact with IBM support to analyze the root cause of the high usage. > > Kind regards > > Mathias Dietz > > IBM Spectrum Scale - Release Lead Architect and RAS Architect > > > > [email protected] wrote on 07/18/2017 07:51:21 PM: > >> From: Jonathon A Anderson <[email protected]> >> To: gpfsug main discussion list <[email protected]> >> Date: 07/18/2017 07:51 PM >> Subject: Re: [gpfsug-discuss] mmsysmon.py revisited >> Sent by: [email protected] >> >> There’s no official way to cleanly disable it so far as I know yet; >> but you can defacto disable it by deleting /var/mmfs/mmsysmon/ >> mmsysmonitor.conf. >> >> It’s a huge problem. I don’t understand why it hasn’t been given >> much credit by dev or support. >> >> ~jonathon >> >> >> On 7/18/17, 11:21 AM, "[email protected] on >> behalf of David Johnson" <[email protected] >> on behalf of [email protected]> wrote: >> >> >> >> >> We also noticed a fair amount of CPU time accumulated by mmsysmon.py on >> our diskless compute nodes. I read the earlier query, where it >> was answered: >> >> >> >> >> ces == Cluster Export Services, mmsysmon.py comes from >> mmcesmon. It is used for managing export services of GPFS. If it is >> killed, your nfs/smb etc will be out of work. >> Their overhead is small and they are very important. Don't >> attempt to kill them. >> >> >> >> >> >> >> Our question is this — we don’t run the latest “protocols", our >> NFS is CNFS, and our CIFS is clustered CIFS. >> I can understand it might be needed with Ganesha, but on every node? >> >> >> Why in the world would I be getting this daemon running on all >> client nodes, when I didn’t install the “protocols" version >> of the distribution? We have release 4.2.2 at the moment. How >> can we disable this? >> >> >> Thanks, >> — ddj >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
