Hello team,
One of the most severe issues hitting our real time application is thread stuck 
for multiple reasons, such as long lasting locks, deadlocks, threads which wait 
for reply forever in case of packet drop issue etc...
Such kind of stuck are under Radar of the existing system health check methods.
In mission critical applications, this will be resulted as an immediate outage.

As a short we are implementing kind of internal watch dog mechanism for stuck 
detector:
               There is a registration object
               Function executor having start/end hooks to register/unregister 
the thread via the registration object
Customized Monitoring scheduled thread is spawned on startup. The thread to 
wake up every N seconds, to scan the registration map and to detect 
unregistered threads for a long time (configurable).
Once such threads has been detected, process stack is taken and thread stack 
statistic metric is provided.

This helps us to monitor, detect and take fast decision about the action which 
should be taken - usually it is member bounce decision (consistency issue is 
possible, in our case it is better than deny of service).
The above solution is not touching GEODE core code, but implemented in 
boundaries of customized code only.

I would like to raise a proposal to introduce a long term generic thread 
monitoring mechanism, to detect threads which are stuck for any reason.
To maintain a monitoring object having a start/end methods to be invoked 
similarly to FunctionStats.startFunctionExecution and 
FunctionStats.endFunctionExecution.

Your feedback would be appreciated

Thank you for cooperation.
Best regards!

Gregory Vortman

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer 
<https://www.amdocs.com/about/email-disclaimer>

Reply via email to