Igniters, Pulling this discussion up. Any thoughts?
-- Denis On Thu, Jun 21, 2018 at 3:52 PM Denis Magda <[email protected]> wrote: > Igniters, > > It's a pleasure to see how our project is evolving in a directing of being > a self-healing solution: > > - Ignite can already handle critical failures such as OOM, File I/O > issues, etc. [1] > - There is an endeavor to fix cluster lock-ins due to partition map > exchange issues. [2] > > There is one more notorious problem that might affect Ignite deployments > which is long stop-the-world GC pauses. > > I know we did a little progress in this direction [3] by providing > particular metrics that help to monitor the pauses. Why don't we keep the > pace and teach Ignite to help itself if it sees there is a node that brings > down overall cluster performance due to an STP? > > I would create policies similar to the critical failures policies [4] or > just add a long STP to the list of critical failures and reuse existing > functionality. > > Thoughts? Anyone who'd like to implement the feature? > > [1] https://apacheignite.readme.io/docs/critical-failures-handling > [2] > http://apache-ignite-developers.2346864.n4.nabble.com/IEP-25-Partition-Map-Exchange-hangs-resolving-td31819.html > [3] https://issues.apache.org/jira/browse/IGNITE-6171 > [4] > https://apacheignite.readme.io/docs/critical-failures-handling#section-failure-handling >
