Hello Nikolay, Created one, available by link [1]
Initially there was an intention to develop it under IEP-47 [2] and there is even a separate section for Maintenance Mode there. But it looks like this feature is useful in more cases and deserves its own IEP. [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode [2] https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation On Thu, Aug 27, 2020 at 11:01 AM Nikolay Izhikov <nizhi...@apache.org> wrote: > Hello, Sergey! > > Thanks for the proposal. > Let’s have IEP for this feature. > > > 27 авг. 2020 г., в 10:25, Sergey Chugunov <sergey.chugu...@gmail.com> > написал(а): > > > > Hello Igniters, > > > > I want to start a discussion about new supporting feature that could be > > very useful in many scenarios where persistent storage is involved: > > Maintenance Mode. > > > > *Summary* > > Maintenance Mode (MM for short) is a special state of Ignite node when > node > > doesn't serve user requests nor joins the cluster but waits for user > > commands or performs automatic actions for maintenance purposes. > > > > *Motivation* > > There are situations when node cannot participate in regular operations > but > > at the same time should not be shut down. > > > > One example is a ticket [1] where I developed the first draft of > > Maintenance Mode. > > Here we get into a situation when node has potentially corrupted PDS thus > > cannot proceed with restore routine and join the cluster as usual. > > At the same time node should not fail nor be stopped for manual cleanup. > > Manual cleanup is not always an option (e.g. restricted access to file > > system); in managed environments failed node will be restarted > > automatically so user won't have time for performing necessary > operations. > > Thus node needs to function in a special mode allowing user to connect to > > it and perform necessary actions. > > > > Another example is described in IEP-47 [2] where defragmentation is being > > developed. Node defragmenting its PDS should not join the cluster until > the > > process is finished so it needs to enter Maintenance Mode as well. > > > > *Suggested design* > > I suggest MM to work as follows: > > 1. Node enters MM if special markers are found on disk. These markers > > called Maintenance Records could be created automatically (e.g. when > > storage component detects corrupted storage) or by user request (when > user > > requests defragmentation of some caches). So entering MM requires node > > restart. > > 2. Started in MM node doesn't join the cluster but finishes startup > routine > > so it is able to receive commands and provide metrics to the user. > > 3. When all necessary maintenance operations are finished, Maintenance > > Records for these operations are deleted from disk and node restarted > again > > to enter normal service. > > > > *Example* > > To put it into a context let's consider an example of how I see the MM > > workflow in case of PDS corruption. > > > > 1. Node has failed in the middle of checkpoint when WAL is disabled for > > a particular cache -> data files of the cache are potentially > corrupted. > > 2. On next startup node detects this situation, creates Maintenance > > Record on disk and shuts down. > > 3. On next startup node sees Maintenance Record, enters Maintenance > Mode > > and waits for user to do specific actions: clean potentially corrupted > PDS. > > 4. When user has done necessary actions he/she removes Maintenance > > Record using Maintenance Mode API exposed via control.{sh|bat} script > or > > JMX. > > 5. On next startup node goes to normal operations as maintenance reason > > is fixed. > > > > > > I prepared a PR [3] for ticket [1] with draft implementation. It is not > > ready to be merged to master branch but is already fully functional and > can > > be reviewed. > > > > Hope you'll share your feedback on the feature and/or any thoughts on > > implementation. > > > > Thank you! > > > > [1] https://issues.apache.org/jira/browse/IGNITE-13366 > > [2] > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation > > [3] https://github.com/apache/ignite/pull/8189 > >