Re: [DISCUSSION] Maintenance Mode feature

Ivan Pavlukhin Mon, 31 Aug 2020 02:32:38 -0700

Hi Sergey,

Thank you for bringing attention to that important subject!


My note here is about one more cluster mode. As far as I know
currently we already have 3 modes (inactive, read-only, read-write)
and the subject is about one more. From the first glance it could be
hard for a user to understand and use all modes properly. Do we really
need all spectrum? Could we simplify things somehow?

2020-08-27 15:59 GMT+03:00, Sergey Chugunov <[email protected]>:
> Hello Nikolay,
>
> Created one, available by link [1]
>
> Initially there was an intention to develop it under IEP-47 [2] and there
> is even a separate section for Maintenance Mode there.
> But it looks like this feature is useful in more cases and deserves its own
> IEP.
>
> [1]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode
> [2]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
>
> On Thu, Aug 27, 2020 at 11:01 AM Nikolay Izhikov <[email protected]>
> wrote:
>
>> Hello, Sergey!
>>
>> Thanks for the proposal.
>> Let’s have IEP for this feature.
>>
>> > 27 авг. 2020 г., в 10:25, Sergey Chugunov <[email protected]>
>> написал(а):
>> >
>> > Hello Igniters,
>> >
>> > I want to start a discussion about new supporting feature that could be
>> > very useful in many scenarios where persistent storage is involved:
>> > Maintenance Mode.
>> >
>> > *Summary*
>> > Maintenance Mode (MM for short) is a special state of Ignite node when
>> node
>> > doesn't serve user requests nor joins the cluster but waits for user
>> > commands or performs automatic actions for maintenance purposes.
>> >
>> > *Motivation*
>> > There are situations when node cannot participate in regular operations
>> but
>> > at the same time should not be shut down.
>> >
>> > One example is a ticket [1] where I developed the first draft of
>> > Maintenance Mode.
>> > Here we get into a situation when node has potentially corrupted PDS
>> > thus
>> > cannot proceed with restore routine and join the cluster as usual.
>> > At the same time node should not fail nor be stopped for manual
>> > cleanup.
>> > Manual cleanup is not always an option (e.g. restricted access to file
>> > system); in managed environments failed node will be restarted
>> > automatically so user won't have time for performing necessary
>> operations.
>> > Thus node needs to function in a special mode allowing user to connect
>> > to
>> > it and perform necessary actions.
>> >
>> > Another example is described in IEP-47 [2] where defragmentation is
>> > being
>> > developed. Node defragmenting its PDS should not join the cluster until
>> the
>> > process is finished so it needs to enter Maintenance Mode as well.
>> >
>> > *Suggested design*
>> > I suggest MM to work as follows:
>> > 1. Node enters MM if special markers are found on disk. These markers
>> > called Maintenance Records could be created automatically (e.g. when
>> > storage component detects corrupted storage) or by user request (when
>> user
>> > requests defragmentation of some caches). So entering MM requires node
>> > restart.
>> > 2. Started in MM node doesn't join the cluster but finishes startup
>> routine
>> > so it is able to receive commands and provide metrics to the user.
>> > 3. When all necessary maintenance operations are finished, Maintenance
>> > Records for these operations are deleted from disk and node restarted
>> again
>> > to enter normal service.
>> >
>> > *Example*
>> > To put it into a context let's consider an example of how I see the MM
>> > workflow in case of PDS corruption.
>> >
>> >   1. Node has failed in the middle of checkpoint when WAL is disabled
>> > for
>> >   a particular cache -> data files of the cache are potentially
>> corrupted.
>> >   2. On next startup node detects this situation, creates Maintenance
>> >   Record on disk and shuts down.
>> >   3. On next startup node sees Maintenance Record, enters Maintenance
>> Mode
>> >   and waits for user to do specific actions: clean potentially
>> > corrupted
>> PDS.
>> >   4. When user has done necessary actions he/she removes Maintenance
>> >   Record using Maintenance Mode API exposed via control.{sh|bat} script
>> or
>> >   JMX.
>> >   5. On next startup node goes to normal operations as maintenance
>> > reason
>> >   is fixed.
>> >
>> >
>> > I prepared a PR [3] for ticket [1] with draft implementation. It is not
>> > ready to be merged to master branch but is already fully functional and
>> can
>> > be reviewed.
>> >
>> > Hope you'll share your feedback on the feature and/or any thoughts on
>> > implementation.
>> >
>> > Thank you!
>> >
>> > [1] https://issues.apache.org/jira/browse/IGNITE-13366
>> > [2]
>> >
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
>> > [3] https://github.com/apache/ignite/pull/8189
>>
>>
>


-- 

Best regards,
Ivan Pavlukhin

Re: [DISCUSSION] Maintenance Mode feature

Reply via email to