Hello Nikolay,

Created one, available by link [1]

Initially there was an intention to develop it under IEP-47 [2] and there
is even a separate section for Maintenance Mode there.
But it looks like this feature is useful in more cases and deserves its own
IEP.

[1]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode
[2]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation

On Thu, Aug 27, 2020 at 11:01 AM Nikolay Izhikov <nizhi...@apache.org>
wrote:

> Hello, Sergey!
>
> Thanks for the proposal.
> Let’s have IEP for this feature.
>
> > 27 авг. 2020 г., в 10:25, Sergey Chugunov <sergey.chugu...@gmail.com>
> написал(а):
> >
> > Hello Igniters,
> >
> > I want to start a discussion about new supporting feature that could be
> > very useful in many scenarios where persistent storage is involved:
> > Maintenance Mode.
> >
> > *Summary*
> > Maintenance Mode (MM for short) is a special state of Ignite node when
> node
> > doesn't serve user requests nor joins the cluster but waits for user
> > commands or performs automatic actions for maintenance purposes.
> >
> > *Motivation*
> > There are situations when node cannot participate in regular operations
> but
> > at the same time should not be shut down.
> >
> > One example is a ticket [1] where I developed the first draft of
> > Maintenance Mode.
> > Here we get into a situation when node has potentially corrupted PDS thus
> > cannot proceed with restore routine and join the cluster as usual.
> > At the same time node should not fail nor be stopped for manual cleanup.
> > Manual cleanup is not always an option (e.g. restricted access to file
> > system); in managed environments failed node will be restarted
> > automatically so user won't have time for performing necessary
> operations.
> > Thus node needs to function in a special mode allowing user to connect to
> > it and perform necessary actions.
> >
> > Another example is described in IEP-47 [2] where defragmentation is being
> > developed. Node defragmenting its PDS should not join the cluster until
> the
> > process is finished so it needs to enter Maintenance Mode as well.
> >
> > *Suggested design*
> > I suggest MM to work as follows:
> > 1. Node enters MM if special markers are found on disk. These markers
> > called Maintenance Records could be created automatically (e.g. when
> > storage component detects corrupted storage) or by user request (when
> user
> > requests defragmentation of some caches). So entering MM requires node
> > restart.
> > 2. Started in MM node doesn't join the cluster but finishes startup
> routine
> > so it is able to receive commands and provide metrics to the user.
> > 3. When all necessary maintenance operations are finished, Maintenance
> > Records for these operations are deleted from disk and node restarted
> again
> > to enter normal service.
> >
> > *Example*
> > To put it into a context let's consider an example of how I see the MM
> > workflow in case of PDS corruption.
> >
> >   1. Node has failed in the middle of checkpoint when WAL is disabled for
> >   a particular cache -> data files of the cache are potentially
> corrupted.
> >   2. On next startup node detects this situation, creates Maintenance
> >   Record on disk and shuts down.
> >   3. On next startup node sees Maintenance Record, enters Maintenance
> Mode
> >   and waits for user to do specific actions: clean potentially corrupted
> PDS.
> >   4. When user has done necessary actions he/she removes Maintenance
> >   Record using Maintenance Mode API exposed via control.{sh|bat} script
> or
> >   JMX.
> >   5. On next startup node goes to normal operations as maintenance reason
> >   is fixed.
> >
> >
> > I prepared a PR [3] for ticket [1] with draft implementation. It is not
> > ready to be merged to master branch but is already fully functional and
> can
> > be reviewed.
> >
> > Hope you'll share your feedback on the feature and/or any thoughts on
> > implementation.
> >
> > Thank you!
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-13366
> > [2]
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
> > [3] https://github.com/apache/ignite/pull/8189
>
>

Reply via email to