Hello Igniters, I want to start a discussion about new supporting feature that could be very useful in many scenarios where persistent storage is involved: Maintenance Mode.
*Summary* Maintenance Mode (MM for short) is a special state of Ignite node when node doesn't serve user requests nor joins the cluster but waits for user commands or performs automatic actions for maintenance purposes. *Motivation* There are situations when node cannot participate in regular operations but at the same time should not be shut down. One example is a ticket [1] where I developed the first draft of Maintenance Mode. Here we get into a situation when node has potentially corrupted PDS thus cannot proceed with restore routine and join the cluster as usual. At the same time node should not fail nor be stopped for manual cleanup. Manual cleanup is not always an option (e.g. restricted access to file system); in managed environments failed node will be restarted automatically so user won't have time for performing necessary operations. Thus node needs to function in a special mode allowing user to connect to it and perform necessary actions. Another example is described in IEP-47 [2] where defragmentation is being developed. Node defragmenting its PDS should not join the cluster until the process is finished so it needs to enter Maintenance Mode as well. *Suggested design* I suggest MM to work as follows: 1. Node enters MM if special markers are found on disk. These markers called Maintenance Records could be created automatically (e.g. when storage component detects corrupted storage) or by user request (when user requests defragmentation of some caches). So entering MM requires node restart. 2. Started in MM node doesn't join the cluster but finishes startup routine so it is able to receive commands and provide metrics to the user. 3. When all necessary maintenance operations are finished, Maintenance Records for these operations are deleted from disk and node restarted again to enter normal service. *Example* To put it into a context let's consider an example of how I see the MM workflow in case of PDS corruption. 1. Node has failed in the middle of checkpoint when WAL is disabled for a particular cache -> data files of the cache are potentially corrupted. 2. On next startup node detects this situation, creates Maintenance Record on disk and shuts down. 3. On next startup node sees Maintenance Record, enters Maintenance Mode and waits for user to do specific actions: clean potentially corrupted PDS. 4. When user has done necessary actions he/she removes Maintenance Record using Maintenance Mode API exposed via control.{sh|bat} script or JMX. 5. On next startup node goes to normal operations as maintenance reason is fixed. I prepared a PR [3] for ticket [1] with draft implementation. It is not ready to be merged to master branch but is already fully functional and can be reviewed. Hope you'll share your feedback on the feature and/or any thoughts on implementation. Thank you! [1] https://issues.apache.org/jira/browse/IGNITE-13366 [2] https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation [3] https://github.com/apache/ignite/pull/8189