Re: [gpfsug-discuss] mmsysmon.py revisited

Michael D Harris Wed, 19 Jul 2017 07:41:00 -0700

Hi David,

Re: "The impact we were seeing was some variation in MPI benchmark results
when the nodes were fully loaded."

MPI workloads show the most mmhealth impact. Specifically the more
sensitive the workload is to jitter the higher the potential impact.

The mmhealth config interval, as per Mathias's link, is a scalar applied to
all monitor interval values in the configuration file. As such it currently
modifies the server side monitoring and health reporting in addition to
mitigating mpi client impact. So "medium" == 5 is a good perhaps reasonable
value - whereas the "slow" == 10 scalar may be too infrequent for your
server side monitoring and reporting (so your 30 second update becomes 5
minutes).

The clock alignment that Mathias mentioned is a new investigatory
undocumented tool for MPI workloads. It nearly completely removes all
mmhealth MPI jitter while retaining default monitor intervals. It also
naturally generates thundering herds of all client reporting to the quorum
nodes. So while you may mitigate the client MPI jitter you may severely
impact the server throughput on those intervals if not also exceed
connection and thread limits.

Configuring "clients" separately from "servers" without resorting to
alignment is another area of investigation.

I'm not familiar with your PMR but as Mathias mentioned "mmhealth config
interval medium" would be a good start. In testing that Kums and I have
done the "mmhealth config interval medium" value provides mitigation almost
as good as the mentioned clock alignment for MPI for say a psnap with
barrier type workload .

Regards, Mike Harris

IBM Spectrum Scale - Core Team

From:   [email protected]
To:     [email protected]
Date:   07/19/2017 09:28 AM
Subject:        gpfsug-discuss Digest, Vol 66, Issue 30
Sent by:        [email protected]

Send gpfsug-discuss mailing list submissions to
                 [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
                 http://gpfsug.org/mailman/listinfo/gpfsug-discuss
or, via email, send a message with subject or body 'help' to
                 [email protected]

You can reach the person managing the list at
                 [email protected]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of gpfsug-discuss digest..."

Today's Topics:

   1. Re: mmsysmon.py revisited (Mathias Dietz)
   2. Re: mmsysmon.py revisited (David Johnson)

----------------------------------------------------------------------

Message: 1
Date: Wed, 19 Jul 2017 15:05:49 +0200
From: "Mathias Dietz" <[email protected]>
To: gpfsug main discussion list <[email protected]>
Subject: Re: [gpfsug-discuss] mmsysmon.py revisited
Message-ID:

<ofca7d9a5e.c7b3505a-onc1258162.00420361-c1258162.0047f...@notes.na.collabserv.com>

Content-Type: text/plain; charset="iso-8859-1"

thanks for the feedback.

Let me clarify what mmsysmon is doing.
Since IBM Spectrum Scale 4.2.1 the mmsysmon process is used for the
overall health monitoring and CES failover handling.
Even without CES it is an essential part of the system because it monitors
the individual components and provides health state information and error
events.
This information is needed by other Spectrum Scale components (mmhealth
command, the IBM Spectrum Scale GUI, Support tools, Install Toolkit,..)
and therefore disabling mmsysmon will impact them.

> It?s a huge problem. I don?t understand why it hasn?t been given
> much credit by dev or support.

Over the last couple of month, the development team has put a strong focus
on this topic.
In order to monitor the health of the individual components, mmsysmon
listens for notifications/callback but also has to do some polling.
We are trying to reduce the polling overhead constantly and replace
polling with notifications when possible.

Several improvements have been added to 4.2.3, including the ability to
configure the polling frequency to reduce the overhead. (mmhealth config
interval)
See
https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1adm_mmhealth.htm

In addition a new option has been introduced to clock align the monitoring
threads in order to reduce CPU jitter.

Nevertheless, we don't see significant CPU consumption by mmsysmon on our
test systems.
It might be a problem specific to your system environment or a wrong
configuration therefore please get in contact with IBM support to analyze
the root cause of the high usage.

Kind regards

Mathias Dietz

IBM Spectrum Scale - Release Lead Architect and RAS Architect

[email protected] wrote on 07/18/2017 07:51:21 PM:

> From: Jonathon A Anderson <[email protected]>
> To: gpfsug main discussion list <[email protected]>
> Date: 07/18/2017 07:51 PM
> Subject: Re: [gpfsug-discuss] mmsysmon.py revisited
> Sent by: [email protected]
>
> There?s no official way to cleanly disable it so far as I know yet;
> but you can defacto disable it by deleting /var/mmfs/mmsysmon/
> mmsysmonitor.conf.
>
> It?s a huge problem. I don?t understand why it hasn?t been given
> much credit by dev or support.
>
> ~jonathon
>
>
> On 7/18/17, 11:21 AM, "[email protected] on
> behalf of David Johnson" <[email protected]
> on behalf of [email protected]> wrote:
>
>
>
>
>     We also noticed a fair amount of CPU time accumulated by mmsysmon.py
on
>     our diskless compute nodes. I read the earlier query, where it
> was answered:
>
>
>
>
>     ces == Cluster Export Services,  mmsysmon.py comes from
> mmcesmon. It is used for managing export services of GPFS. If it is
> killed,  your nfs/smb etc will be out of work.
>     Their overhead is small and they are very important. Don't
> attempt to kill them.
>
>
>
>
>
>
>     Our question is this ? we don?t run the latest ?protocols", our
> NFS is CNFS, and our CIFS is clustered CIFS.
>     I can understand it might be needed with Ganesha, but on every node?

>
>
>     Why in the world would I be getting this daemon running on all
> client nodes, when I didn?t install the ?protocols" version
>     of the distribution?   We have release 4.2.2 at the moment.  How
> can we disable this?
>
>
>     Thanks,
>      ? ddj
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <
http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20170719/8c0e33e9/attachment-0001.html
>

------------------------------

Message: 2
Date: Wed, 19 Jul 2017 09:28:23 -0400
From: David Johnson <[email protected]>
To: gpfsug main discussion list <[email protected]>
Subject: Re: [gpfsug-discuss] mmsysmon.py revisited
Message-ID: <[email protected]>
Content-Type: text/plain; charset="utf-8"

I have opened a PMR, and the official response reflects what you just
posted.
In addition, it seems there are some performance issues with Python 2 that
will be
improved with eventual migration to Python 3.  I was unaware of the
mmhealth
functions that the mmsysmon daemon provides. The impact we were seeing
was some variation in MPI benchmark results when the nodes were fully
loaded.
I suppose it would be possible to turn off mmsysmon during the
benchmarking,
but I appreciate the effort at streamlining the monitor service.  Cutting
back on
fork/exec, better python, less polling, more notifications?  all good.

Thanks for the details,

 ? ddj

> On Jul 19, 2017, at 9:05 AM, Mathias Dietz <[email protected]> wrote:
>
> thanks for the feedback.
>
> Let me clarify what mmsysmon is doing.
> Since IBM Spectrum Scale 4.2.1 the mmsysmon process is used for the
overall health monitoring and CES failover handling.
> Even without CES it is an essential part of the system because it
monitors the individual components and provides health state information
and error events.
> This information is needed by other Spectrum Scale components (mmhealth
command, the IBM Spectrum Scale GUI, Support tools, Install Toolkit,..) and
therefore disabling mmsysmon will impact them.
>
> > It?s a huge problem. I don?t understand why it hasn?t been given
> > much credit by dev or support.
>
> Over the last couple of month, the development team has put a strong
focus on this topic.
> In order to monitor the health of the individual components, mmsysmon
listens for notifications/callback but also has to do some polling.
> We are trying to reduce the polling overhead constantly and replace
polling with notifications when possible.
>
> Several improvements have been added to 4.2.3, including the ability to
configure the polling frequency to reduce the overhead. (mmhealth config
interval)
> See
https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1adm_mmhealth.htm
 <
https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1adm_mmhealth.htm
>
> In addition a new option has been introduced to clock align the
monitoring threads in order to reduce CPU jitter.
>
> Nevertheless, we don't see significant CPU consumption by mmsysmon on our
test systems.
> It might be a problem specific to your system environment or a wrong
configuration therefore please get in contact with IBM support to analyze
the root cause of the high usage.
>
> Kind regards
>
> Mathias Dietz
>
> IBM Spectrum Scale - Release Lead Architect and RAS Architect
>
>
> [email protected] wrote on 07/18/2017 07:51:21 PM:
>
> > From: Jonathon A Anderson <[email protected]>
> > To: gpfsug main discussion list <[email protected]>
> > Date: 07/18/2017 07:51 PM
> > Subject: Re: [gpfsug-discuss] mmsysmon.py revisited
> > Sent by: [email protected]
> >
> > There?s no official way to cleanly disable it so far as I know yet;
> > but you can defacto disable it by deleting /var/mmfs/mmsysmon/
> > mmsysmonitor.conf.
> >
> > It?s a huge problem. I don?t understand why it hasn?t been given
> > much credit by dev or support.
> >
> > ~jonathon
> >
> >
> > On 7/18/17, 11:21 AM, "[email protected] on
> > behalf of David Johnson" <[email protected]
> > on behalf of [email protected]> wrote:
> >
> >
> >
> >
> >     We also noticed a fair amount of CPU time accumulated by
mmsysmon.py on
> >     our diskless compute nodes. I read the earlier query, where it
> > was answered:
> >
> >
> >
> >
> >     ces == Cluster Export Services,  mmsysmon.py comes from
> > mmcesmon. It is used for managing export services of GPFS. If it is
> > killed,  your nfs/smb etc will be out of work.
> >     Their overhead is small and they are very important. Don't
> > attempt to kill them.
> >
> >
> >
> >
> >
> >
> >     Our question is this ? we don?t run the latest ?protocols", our
> > NFS is CNFS, and our CIFS is clustered CIFS.
> >     I can understand it might be needed with Ganesha, but on every
node?
> >
> >
> >     Why in the world would I be getting this daemon running on all
> > client nodes, when I didn?t install the ?protocols" version
> >     of the distribution?   We have release 4.2.2 at the moment.  How
> > can we disable this?
> >
> >
> >     Thanks,
> >      ? ddj
> >
> >
> > _______________________________________________
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss <
http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <
http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20170719/669c525b/attachment.html
>

------------------------------

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

End of gpfsug-discuss Digest, Vol 66, Issue 30
**********************************************

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] mmsysmon.py revisited

Reply via email to