[ 
https://issues.apache.org/jira/browse/GEODE-5142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573947#comment-16573947
 ] 

ASF subversion and git services commented on GEODE-5142:
--------------------------------------------------------

Commit 143419e91d9c714338833b3a6cff598a7dc7f096 in geode's branch 
refs/heads/feature/Micrometer from [~bschuchardt]
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=143419e ]

GEODE-5142 new Thread Monitoring Mechanism

One of the most severe issues hitting our real time application is thread
stuck for multiple reasons, such as long lasting locks, deadlocks, threads
which wait for reply forever in case of packet drop issue etc...

Such kind of stuck are under Radar of the existing system health check
methods. In mission critical applications, this will be resulted as an
immediate outage.

Here we introduce thread monitoring mechanism, to detect threads which are
stuck for any reason.

If a thread is deemed stuck you will see a message like this in the
Geode log file at "warn" level:

  Thread <thread name> is stuck , initiating handleExpiry

There is also a new ResourceManager statistic keep track of the number
of stuck threads in a JVM.

Thread monitoring is enabled by default.  To disable it, set this
distributed system property:

thread-monitor-enabled=false

To adjust the monitoring interval from the default 60,000ms set this
distributed system property:

thread-monitor-interval-ms

To adjust how long a thread is unchanging before it is considered stuck
set this distributed system property, which defaults to 30,000ms:

thread-monitor-time-limit-ms

This closes #1868

commit 70b0e0889695ea3efc1ea7e095a90e6feb6a2b7f
Author: yossireg <[email protected]>
Date:   Sat Jul 21 23:23:33 2018 +0300

    Thread Monitoring - fixing integrated and distributed tests

commit 7fe285f03a20a57a1d80341e6381a8f6c4e50fd9
Author: yossireg <[email protected]>
Date:   Wed Jul 18 19:16:09 2018 +0300

    retrigger concourse-ci/UITests after timeout failure - 
org.openqa.selenium.TimeoutException

commit 1c791d3a722c7651f41dd5f54e2f3818787e04f8
Author: yossireg <[email protected]>
Date:   Wed Jul 18 16:52:19 2018 +0300

    Thread Monitor - fixing build error

commit 74aaceec67308791f8c831d9e1650a21a9a7762e
Merge: c076dcb25 897549530
Author: yossireg <[email protected]>
Date:   Wed Jul 18 14:57:38 2018 +0300

    Merge remote-tracking branch 'upstream/develop' into develop

commit c076dcb25bf5cf39859a46f9132c8d185cdf7c0e
Author: yossireg <[email protected]>
Date:   Wed Jul 18 12:48:52 2018 +0300

    ThreadMonitor - resolve complex conflict

commit a0b22cbb0d8664d110ac3b0d8b869969a3923d2e
Author: yossireg <[email protected]>
Date:   Wed Jul 18 10:53:21 2018 +0300

    Thread Monitoring changes - as per Bruce request

commit 9b03e0258ea47c68ea02d719f615e1d4473948d9
Author: yossireg <[email protected]>
Date:   Sun Jul 1 16:41:56 2018 +0300

    ThreadsMonitoring changes - as per Galen's request

commit 84040cdb3c8f3d5194afd727d5b58f503d7b92d8
Author: yossireg <[email protected]>
Date:   Sun Jun 3 14:30:12 2018 +0300

    Thread Monitoring - fixing merge spotlessApply

commit 42b4d2621269caf0450201f59ceca10f7bf84207
Merge: 5225d2910 62665a47a
Author: yossireg <[email protected]>
Date:   Sun Jun 3 09:08:59 2018 +0300

    Merge branch 'develop' into develop

commit 5225d2910ad5405d891df6dc0fc4a32756305fe2
Author: yossireg <[email protected]>
Date:   Tue May 29 21:01:04 2018 +0300

    Thread Monitor

commit 95b7fc9b43872794a953d6599c2a5ca5d28b985f
Merge: 17a631059 83d7f6aae
Author: yossireg <[email protected]>
Date:   Tue May 29 18:33:56 2018 +0300

    Merge branch 'develop' of https://github.com/yossireg/geode into develop

commit 17a6310599f0fd1b4ba7afe23b6a7d2eb2dbdb66
Author: yossireg <[email protected]>
Date:   Tue May 29 18:30:12 2018 +0300

    Thread Monitoring - fixing geode-wan error

commit 83d7f6aae7e7fc68edf995d3f2c2e567967b09c2
Merge: a6d4acaf7 4e7c6f1f3
Author: yossireg <[email protected]>
Date:   Tue May 29 17:12:19 2018 +0300

    Merge branch 'develop' into develop

commit a6d4acaf7be35cf87d01acbf61cf4242b4f0bf98
Author: yossireg <[email protected]>
Date:   Tue May 29 16:57:40 2018 +0300

    Thread Monitoring - removing singleton pattern

commit a622b21627c61723d35a127d23dd1c5d2ae2289e
Author: yossireg <[email protected]>
Date:   Wed May 23 10:25:25 2018 +0300

    ThreadMonitoring - changing object of TM to be final

commit dc4989c85a0aa9f6c46466e9665f7915edc9374f
Author: yossireg <[email protected]>
Date:   Tue May 15 16:05:38 2018 +0300

    Thread Monitoring - changing param to static final

commit 46fe08c08ede2afc3d535c516e63b35bf073fd61
Author: yossireg <[email protected]>
Date:   Thu May 10 12:27:44 2018 +0300

    ThreadMonitoring Sync to last changes in develop branch

commit 18cd7bc95e8b0dd734a03e34cedd4597896e5279
Merge: 1f3f0e3eb de8800047
Author: yossireg <[email protected]>
Date:   Thu May 10 10:50:09 2018 +0300

    Merge branch 'develop' into develop

commit 1f3f0e3eb3df7a5dd965829bb59552cf6a8935bb
Author: yossireg <[email protected]>
Date:   Thu May 10 08:08:57 2018 +0300

    ThreadMonitoring

    retriggering the CI after it failed while it shouldnt have - locally it is 
working fine and the error mentioned there that Symbol does not exists should 
not occure

commit ae0b7639f4bff65c4613fac670043053910b7920
Author: yossireg <[email protected]>
Date:   Wed May 9 14:35:30 2018 +0300

    ThreadMonitoring

    Code changes as per community request

commit a5eeeb97190cc8a3dda9d5a228166a839212602c
Author: yossireg <[email protected]>
Date:   Thu Apr 26 11:04:40 2018 +0300

    Thread Monitoring Mechanism


> new Thread Monitoring Mechanism
> -------------------------------
>
>                 Key: GEODE-5142
>                 URL: https://issues.apache.org/jira/browse/GEODE-5142
>             Project: Geode
>          Issue Type: New Feature
>          Components: management
>            Reporter: yossi reginiano
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> One of the most severe issues hitting our real time application is thread 
> stuck for multiple reasons, such as long lasting locks, deadlocks, threads 
> which wait for reply forever in case of packet drop issue etc...
> Such kind of stuck are under Radar of the existing system health check 
> methods. In mission critical applications, this will be resulted as an 
> immediate outage.
> Here we introduce thread monitoring mechanism, to detect threads which are 
> stuck for any reason.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to