Re: Facility to detect long STW pauses and other system response degradations

Anton Vinogradov Wed, 22 Nov 2017 05:54:21 -0800

Vova,

Monitoring systems works not how you expected, they LOVE incremental
metrics :)


> 1) When did these events appear?

Monitoring gets metrics each N seconds.

01.00 got 0, 0
02.00 got 1, 20 -> means between 02.00 and 01.00 was 1 STW with duration  20
03.00 got 3, 100  -> means between 03.00 and 02.00 was 2 STW with total
duration 80

Good monitoring will record this values and show you graph when you decide
to check this value.
So, you'll see "0,1,2" and "0,20,80" at 03.01

> 2) How duration is distributed? Was 10 pauses 10 seconds each, or 9 short
> pauses of 1 sec and one critical pause of 90s?

So, previous probe was 0, 0 and we got it minute ago.
Now we have 10, 100. It means we have 10 STW with total duration 100 last
minute.

But, it case we set interval to 10 seconds we'll get
1, 20
4, 20
0, 0
4, 10
1, 50
0, 0

So, precision depends on check interval. And it's up to administration team
what interval to choose.

> May be a kind of sliding window plus min/max values will do better job.

That's the monitoring system job (eg. zabbix)


On Wed, Nov 22, 2017 at 4:10 PM, Vladimir Ozerov <voze...@gridgain.com>
wrote:

> The question is how administrator should interpret these numbers. Ok, I
> opened JMX console and see that there was 10 long GC events, which took 100
> seconds.
> 1) When did these events appear? Over the last day, which is more or less
> OK, or over the last 10 minutes, so my server is nearly dead?
> 2) How duration is distributed? Was 10 pauses 10 seconds each, or 9 short
> pauses of 1 sec and one critical pause of 90s?
>
> May be a kind of sliding window plus min/max values will do better job.
>
> On Wed, Nov 22, 2017 at 1:07 PM, Anton Vinogradov <
> avinogra...@gridgain.com>
> wrote:
>
> > Vova,
> >
> > 1) We can gain collections info from GarbageCollectorMXBean
> > But it provides only
> > - collectionCount
> > - collectionTime.
> >
> > This is very interesting metrics, but they tell us nothing about long
> STW.
> > Long STW means we have huge latency during STW and we should find the
> > reason and solve it.
> >
> > 2) So, we're working on new incremental metrics
> > - total amount of STW longer than XXX
> > - total duration of STW longer than XXX
> >
> > which shows us JVM/GC health situation.
> >
> > Is it answer to your question?
> >
> > On Tue, Nov 21, 2017 at 9:05 PM, Vladimir Ozerov <voze...@gridgain.com>
> > wrote:
> >
> > > Anton,
> > >
> > > The question is why user may need so precise measurement? I share
> > Andrey’s
> > > opinion - cannot understand the value.
> > >
> > > вт, 21 нояб. 2017 г. в 19:33, Anton Vinogradov <
> avinogra...@gridgain.com
> > >:
> > >
> > > > Andrey,
> > > >
> > > > >  JVM provides sufficient means of detecting a struggling process
> out
> > of
> > > > the box.
> > > >
> > > > Could you point to some articles describing how to detect STW
> exceeding
> > > > some duration using only JVM API?
> > > >
> > > > On Tue, Nov 21, 2017 at 7:17 PM, Andrey Kornev <
> > andrewkor...@hotmail.com
> > > >
> > > > wrote:
> > > >
> > > > > My 2 cents. Don’t do it. JVM provides sufficient means of
> detecting a
> > > > > struggling process out of the box. SRE/Operations teams usually
> know
> > > how
> > > > to
> > > > > monitor JVMs and can handle killing of such processes themselves.
> > > > >
> > > > > The feature adds no value, just complexity (and more configuration
> > > > > parameters (!) — as if Ignite didn’t have enough of them already).
> > > > >
> > > > > Regards,
> > > > > Andrey
> > > > > _____________________________
> > > > > From: Denis Magda <dma...@apache.org>
> > > > > Sent: Monday, November 20, 2017 3:10 PM
> > > > > Subject: Re: Facility to detect long STW pauses and other system
> > > response
> > > > > degradations
> > > > > To: <dev@ignite.apache.org>
> > > > >
> > > > >
> > > > > My 2 cents.
> > > > >
> > > > > 1. Totally for a separate native process that will handle the
> > > monitoring
> > > > > of an Ignite process. The watchdog process can simply start a JVM
> > tool
> > > > like
> > > > > jstat and parse its GC logs: https://dzone.com/articles/
> > > > > how-monitor-java-garbage <https://dzone.com/articles/
> > > > > how-monitor-java-garbage>
> > > > >
> > > > > 2. As for the STW handling, I would make a possible reaction more
> > > > generic.
> > > > > Let’s define a policy (enumeration) that will define how to deal
> with
> > > an
> > > > > unstable node. The events might be as follows - kill a node,
> restart
> > a
> > > > > node, trigger a custom script using Runtime.exec or other methods.
> > > > >
> > > > > What’d you think? Specifically on point 2.
> > > > >
> > > > > —
> > > > > Denis
> > > > >
> > > > > > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <
> > > > avinogra...@gridgain.com>
> > > > > wrote:
> > > > > >
> > > > > > Yakov,
> > > > > >
> > > > > > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> > > > > >
> > > > > > We split issue to
> > > > > > #1 STW duration metrics
> > > > > > #2 External monitoring allows to stop node during STW
> > > > > >
> > > > > >> Testing GC pause with java thread is
> > > > > >> a bit strange and can give info only after GC pause finishes.
> > > > > >
> > > > > > That's ok since it's #1
> > > > > >
> > > > > > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> > > > > sbt.sorokin....@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> I have tested solution with java-thread and GC logs had contain
> > same
> > > > > pause
> > > > > >> values of thread stopping which was detected by java-thread.
> > > > > >>
> > > > > >>
> > > > > >> My log (contains pauses > 100ms):
> > > > > >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too
> long
> > > STW
> > > > > >> pause: 507 milliseconds.
> > > > > >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too
> long
> > > STW
> > > > > >> pause: 5595 milliseconds.
> > > > > >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too
> long
> > > STW
> > > > > >> pause: 3262 milliseconds.
> > > > > >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too
> long
> > > STW
> > > > > >> pause: 1737 milliseconds.
> > > > > >>
> > > > > >> GC log:
> > > > > >> gridgain@dell-5580-92zc8h2:~$ cat
> > > > > >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> > > > > >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which
> > > application
> > > > > >> threads were stopped: 0,0000845 seconds, Stopping threads took:
> > > > > 0,0000246
> > > > > >> seconds
> > > > > >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which
> > > application
> > > > > >> threads were stopped: 0,0001072 seconds, Stopping threads took:
> > > > > 0,0000252
> > > > > >> seconds
> > > > > >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which
> > > application
> > > > > >> threads were stopped: 0,5001082 seconds, Stopping threads took:
> > > > > 0,0000178
> > > > > >> seconds // GOT!
> > > > > >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which
> > > application
> > > > > >> threads were stopped: 5,5856603 seconds, Stopping threads took:
> > > > > 0,0000229
> > > > > >> seconds // GOT!
> > > > > >> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which
> > > application
> > > > > >> threads were stopped: 3,2595700 seconds, Stopping threads took:
> > > > > 0,0000223
> > > > > >> seconds // GOT!
> > > > > >> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which
> > > application
> > > > > >> threads were stopped: 1,7337123 seconds, Stopping threads took:
> > > > > 0,0000121
> > > > > >> seconds // GOT!
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.
> com/
> > > > > >>
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Facility to detect long STW pauses and other system response degradations

Reply via email to