Denis, > 1. Totally for a separate native process that will handle the monitoring of an Ignite process. The watchdog process can simply start a JVM tool like jstat and parse its GC logs: https://dzone.com/articles/ how-monitor-java-garbage <https://dzone.com/articles/ how-monitor-java-garbage> Different GC and even same GC at different OS/JVM produce different logs. That's not easy to parse them. But, since http://gceasy.io can do that, it looks to be possible, somehow :) . Do you know any libs or solutions allows to do this at realtime?
> 2. As for the STW handling, I would make a possible reaction more generic. Let’s define a policy (enumeration) that will define how to deal with an unstable node. The events might be as follows - kill a node, restart a node, trigger a custom script using Runtime.exec or other methods. Yes, it should be similar to segmentation policy + custom script execution. On Tue, Nov 21, 2017 at 2:10 AM, Denis Magda <dma...@apache.org> wrote: > My 2 cents. > > 1. Totally for a separate native process that will handle the monitoring > of an Ignite process. The watchdog process can simply start a JVM tool like > jstat and parse its GC logs: https://dzone.com/articles/ > how-monitor-java-garbage <https://dzone.com/articles/ > how-monitor-java-garbage> > > 2. As for the STW handling, I would make a possible reaction more generic. > Let’s define a policy (enumeration) that will define how to deal with an > unstable node. The events might be as follows - kill a node, restart a > node, trigger a custom script using Runtime.exec or other methods. > > What’d you think? Specifically on point 2. > > — > Denis > > > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <avinogra...@gridgain.com> > wrote: > > > > Yakov, > > > > Issue is https://issues.apache.org/jira/browse/IGNITE-6171 > > > > We split issue to > > #1 STW duration metrics > > #2 External monitoring allows to stop node during STW > > > >> Testing GC pause with java thread is > >> a bit strange and can give info only after GC pause finishes. > > > > That's ok since it's #1 > > > > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin < > sbt.sorokin....@gmail.com> > > wrote: > > > >> I have tested solution with java-thread and GC logs had contain same > pause > >> values of thread stopping which was detected by java-thread. > >> > >> > >> My log (contains pauses > 100ms): > >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long STW > >> pause: 507 milliseconds. > >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long STW > >> pause: 5595 milliseconds. > >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long STW > >> pause: 3262 milliseconds. > >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long STW > >> pause: 1737 milliseconds. > >> > >> GC log: > >> gridgain@dell-5580-92zc8h2:~$ cat > >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total > >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which application > >> threads were stopped: 0,0000845 seconds, Stopping threads took: > 0,0000246 > >> seconds > >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which application > >> threads were stopped: 0,0001072 seconds, Stopping threads took: > 0,0000252 > >> seconds > >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which application > >> threads were stopped: 0,5001082 seconds, Stopping threads took: > 0,0000178 > >> seconds // GOT! > >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which application > >> threads were stopped: 5,5856603 seconds, Stopping threads took: > 0,0000229 > >> seconds // GOT! > >> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which application > >> threads were stopped: 3,2595700 seconds, Stopping threads took: > 0,0000223 > >> seconds // GOT! > >> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which application > >> threads were stopped: 1,7337123 seconds, Stopping threads took: > 0,0000121 > >> seconds // GOT! > >> > >> > >> > >> > >> -- > >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/ > >> > >