Re: Automatic Handling of Long Stop-the-World Pauses

Denis Magda Tue, 10 Jul 2018 11:59:27 -0700

I see, then we need to come up with an external process-based solution for
the sake of the new ticket.


--
Denis

On Tue, Jul 10, 2018 at 6:01 AM Andrey Gura <[email protected]> wrote:

> Denis,
>
> we have LongJVMPauseDetector. But it is Java thread that will be in
> safe-point during stop-the-world pause and therefore will not make any
> progress. So only external process can detect SW pause.
> On Mon, Jul 2, 2018 at 10:34 PM Denis Magda <[email protected]> wrote:
> >
> > Pavel,
> >
> > We already can monitor the state of individual nodes and show it through
> > metrics. Now I'd like to see how to go further and automate a decision on
> > if a node should be kicked off from the cluster or not.
> >
> > --
> > Denis
> >
> > On Mon, Jul 2, 2018 at 12:28 PM Pavel Kovalenko <[email protected]>
> wrote:
> >
> > > Denis,
> > >
> > > I think, JVM can't easily help to itself if it's in SW pause. Most
> > > solutions what I saw about handling such situations are checking
> heartbeats
> > > on other nodes or run in parallel supervisor process which can detect
> that
> > > JVM with Ignite in SW.
> > >
> > > 2018-07-02 20:54 GMT+03:00 Denis Magda <[email protected]>:
> > >
> > > > Igniters,
> > > >
> > > > Pulling this discussion up. Any thoughts?
> > > >
> > > > --
> > > > Denis
> > > >
> > > > On Thu, Jun 21, 2018 at 3:52 PM Denis Magda <[email protected]>
> wrote:
> > > >
> > > > > Igniters,
> > > > >
> > > > > It's a pleasure to see how our project is evolving in a directing
> of
> > > > being
> > > > > a self-healing solution:
> > > > >
> > > > >    - Ignite can already handle critical failures such as OOM, File
> I/O
> > > > >    issues, etc. [1]
> > > > >    - There is an endeavor to fix cluster lock-ins due to partition
> map
> > > > >    exchange issues. [2]
> > > > >
> > > > > There is one more notorious problem that might affect Ignite
> > > deployments
> > > > > which is long stop-the-world GC pauses.
> > > > >
> > > > > I know we did a little progress in this direction [3] by providing
> > > > > particular metrics that help to monitor the pauses. Why don't we
> keep
> > > the
> > > > > pace and teach Ignite to help itself if it sees there is a node
> that
> > > > brings
> > > > > down overall cluster performance due to an STP?
> > > > >
> > > > > I would create policies similar to the critical failures policies
> [4]
> > > or
> > > > > just add a long STP to the list of critical failures and reuse
> existing
> > > > > functionality.
> > > > >
> > > > > Thoughts? Anyone who'd like to implement the feature?
> > > > >
> > > > > [1] https://apacheignite.readme.io/docs/critical-failures-handling
> > > > > [2]
> > > > > http://apache-ignite-developers.2346864.n4.nabble.
> > > > com/IEP-25-Partition-Map-Exchange-hangs-resolving-td31819.html
> > > > > [3] https://issues.apache.org/jira/browse/IGNITE-6171
> > > > > [4]
> > > > > https://apacheignite.readme.io/docs/critical-failures-
> > > > handling#section-failure-handling
> > > > >
> > > >
> > >
>

Re: Automatic Handling of Long Stop-the-World Pauses

Reply via email to