[jira] [Updated] (IGNITE-21140) Ignite 3 Disaster Recovery

Kirill Tkalenko (Jira) Sat, 04 May 2024 01:17:10 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-21140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kirill Tkalenko updated IGNITE-21140:
-------------------------------------
    Description: 
This epic is related to issues that may happen with users when part of their 
data becomes unavailable for some reasons, like "node is lost", or "part of the 
storage is lost", etc.

Following definitions will be used throughout:

Local partition states. A local property of replica, storage, state machine, 
etc., associated with the partition:
 * _Healthy_
State machine is running, everything’s fine.
 * _Initializing_
Ignite node is online, but the corresponding raft group is yet to complete its 
initialization.
 * _Snapshot installation_
Full state transfer is taking place. Once it’s finished, the partition will 
become _healthy_ or {_}catching-up{_}. Before that, data can’t be read, and log 
replication is also on pause.
 * _Catching-up_
Node is in the process of replicating data from the leader, and its data is a 
little bit in the past. This state can only be observed from the leader, 
because only the leader has the latest committed index and the state of every 
peer.
 * _Broken_
Something’s wrong with the state machine. Some data might be unavailable for 
reading, log can’t be replicated, and this state won’t be changed automatically 
without intervention.

Global partition states. A global property of a partition, that specifies its 
apparent functionality from user’s point of view:
 * _Available partition_
Healthy partition that can process read and write requests. This means that the 
majority of peers are healthy at the moment.
 * _Read-only partition_
Partition that can process read requests, but can’t process write requests. 
There’s no healthy majority, but there’s at least one alive (healthy/catch-up) 
peer that can process historical read-only queries.
 * _Unavailable partition_
Partition that can’t process any requests.

Building blocks are a set of operations that can be executed by Ignite or by 
the user in order to improve cluster state.

Each building block must either be an automatic action with configurable 
timeout (if applicable), or a documented API, with mandatory 
diagnostics/metrics that would allow users to make decisions about these 
actions.
 # Offline Ignite node is brought back online, having all recent data.
_Not a disaster recovery mechanism, but worth mentioning._
A node with usable data, that doesn’t require full state transfer, will become 
a peer, will participate in voting and replication, allowing partition to be 
_available_ if majority is healthy. This is the best case for the user, where 
they simply restart offline nodes and the cluster continues being operable.
 # Automatic group scale-down.
Should happen when an Ignite node is offline for too long.
Not a disaster recovery mechanism, but worth mentioning.
Only happens when the majority is online, meaning that user data is safe.
 # Manual partition restart.
Should be performed manually for broken peers.
 # Manual group peers/learners reconfiguration.
Should be performed on a group manually, if the majority is considered 
permanently lost.
 # Freshly re-entering the group.
Should happen when an Ignite node is returned back to the group, but partition 
data is missing.
 # Cleaning the partition data.
If, for some reason, we know that a certain partition on a certain node is 
broken, we may ask Ignite to drop its data and re-enter the group empty (as 
stated in option 5).
Having a dedicated operation for cleaning the partition is preferable, because:
 ## partition is be stored in several storages
 ## not all of them have a “file per partition” storage format, not even close
 ## there’s also raft log that should be cleaned, most likely
 ## maybe raft meta as well
 # Partial truncation of the log’s suffix.
This is a case of partial cleanup of partition data. This operation might be 
useful if we know that there’s junk in the log, but storages are not corrupted, 
so there’s a chance to save some data. Can be replaced with “clean partition 
data”.

In order for the user to make decisions about manual operations, we must 
provide partition states for all partitions in all tables/zones. Both global 
and local states. Global states are more important, because they directly 
correlate with user experience.

Some states will automatically lead to “available” partitions, if the system 
overall is healthy and we simply wait for some time. For example, we wait until 
a snapshot installation, or a rebalance is complete, and we’re happy. This is 
not considered a building block, because it’s a natural artifact of the 
architecture.

Current list is not exhaustive, it consists of basic actions that we could 
implement that would cover a wide range of potential issues.
Any other addition to the list of basic blocks would simply refine it, 
potentially allowing users to recover faster, or with less data being lost, or 
without data loss at all. All such refinements will be considered in later 
releases.

  was:
This epic is related to issues that may happen with users when part of their 
data becomes unavailable for some reasons, like "node is lost", or "part of the 
storage is lost", etc.

Following definitions will be used throughout:

Local partition states. A local property of replica, storage, state machine, 
etc., associated with the partition:
 * _Healthy_
State machine is running, everything’s fine.
 * _Initializing_
Ignite node is online, but the corresponding raft group is yet to complete its 
initialization.
 * _Snapshot installation_
Full state transfer is taking place. Once it’s finished, the partition will 
become _healthy_ or {_}catching-up{_}. Before that, data can’t be read, and log 
replication is also on pause.
 * _Catching-up_
Node is in the process of replicating data from the leader, and its data is a 
little bit in the past. This state can only be observed from the leader, 
because only the leader has the latest committed index and the state of every 
peer.
 * _Broken_
Something’s wrong with the state machine. Some data might be unavailable for 
reading, log can’t be replicated, and this state won’t be changed automatically 
without intervention.

 * Global partition states. A global property of a partition, that specifies 
its apparent functionality from user’s point of view:
 * _Available partition_
Healthy partition that can process read and write requests. This means that the 
majority of peers are healthy at the moment.
 * _Read-only partition_
Partition that can process read requests, but can’t process write requests. 
There’s no healthy majority, but there’s at least one alive (healthy/catch-up) 
peer that can process historical read-only queries.
 * _Unavailable partition_
Partition that can’t process any requests.

Building blocks are a set of operations that can be executed by Ignite or by 
the user in order to improve cluster state.

Each building block must either be an automatic action with configurable 
timeout (if applicable), or a documented API, with mandatory 
diagnostics/metrics that would allow users to make decisions about these 
actions.
 # Offline Ignite node is brought back online, having all recent data.
_Not a disaster recovery mechanism, but worth mentioning._
A node with usable data, that doesn’t require full state transfer, will become 
a peer, will participate in voting and replication, allowing partition to be 
_available_ if majority is healthy. This is the best case for the user, where 
they simply restart offline nodes and the cluster continues being operable.
 # Automatic group scale-down.
Should happen when an Ignite node is offline for too long.
Not a disaster recovery mechanism, but worth mentioning.
Only happens when the majority is online, meaning that user data is safe.
 # Manual partition restart.
Should be performed manually for broken peers.
 # Manual group peers/learners reconfiguration.
Should be performed on a group manually, if the majority is considered 
permanently lost.
 # Freshly re-entering the group.
Should happen when an Ignite node is returned back to the group, but partition 
data is missing.
 # Cleaning the partition data.
If, for some reason, we know that a certain partition on a certain node is 
broken, we may ask Ignite to drop its data and re-enter the group empty (as 
stated in option 5).
Having a dedicated operation for cleaning the partition is preferable, because:
 ## partition is be stored in several storages
 ## not all of them have a “file per partition” storage format, not even close
 ## there’s also raft log that should be cleaned, most likely
 ## maybe raft meta as well
 # Partial truncation of the log’s suffix.
This is a case of partial cleanup of partition data. This operation might be 
useful if we know that there’s junk in the log, but storages are not corrupted, 
so there’s a chance to save some data. Can be replaced with “clean partition 
data”.

In order for the user to make decisions about manual operations, we must 
provide partition states for all partitions in all tables/zones. Both global 
and local states. Global states are more important, because they directly 
correlate with user experience.

Some states will automatically lead to “available” partitions, if the system 
overall is healthy and we simply wait for some time. For example, we wait until 
a snapshot installation, or a rebalance is complete, and we’re happy. This is 
not considered a building block, because it’s a natural artifact of the 
architecture.

Current list is not exhaustive, it consists of basic actions that we could 
implement that would cover a wide range of potential issues.
Any other addition to the list of basic blocks would simply refine it, 
potentially allowing users to recover faster, or with less data being lost, or 
without data loss at all. All such refinements will be considered in later 
releases.


> Ignite 3 Disaster Recovery
> --------------------------
>
>                 Key: IGNITE-21140
>                 URL: https://issues.apache.org/jira/browse/IGNITE-21140
>             Project: Ignite
>          Issue Type: Epic
>            Reporter: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>
> This epic is related to issues that may happen with users when part of their 
> data becomes unavailable for some reasons, like "node is lost", or "part of 
> the storage is lost", etc.
> Following definitions will be used throughout:
> Local partition states. A local property of replica, storage, state machine, 
> etc., associated with the partition:
>  * _Healthy_
> State machine is running, everything’s fine.
>  * _Initializing_
> Ignite node is online, but the corresponding raft group is yet to complete 
> its initialization.
>  * _Snapshot installation_
> Full state transfer is taking place. Once it’s finished, the partition will 
> become _healthy_ or {_}catching-up{_}. Before that, data can’t be read, and 
> log replication is also on pause.
>  * _Catching-up_
> Node is in the process of replicating data from the leader, and its data is a 
> little bit in the past. This state can only be observed from the leader, 
> because only the leader has the latest committed index and the state of every 
> peer.
>  * _Broken_
> Something’s wrong with the state machine. Some data might be unavailable for 
> reading, log can’t be replicated, and this state won’t be changed 
> automatically without intervention.
> Global partition states. A global property of a partition, that specifies its 
> apparent functionality from user’s point of view:
>  * _Available partition_
> Healthy partition that can process read and write requests. This means that 
> the majority of peers are healthy at the moment.
>  * _Read-only partition_
> Partition that can process read requests, but can’t process write requests. 
> There’s no healthy majority, but there’s at least one alive 
> (healthy/catch-up) peer that can process historical read-only queries.
>  * _Unavailable partition_
> Partition that can’t process any requests.
> Building blocks are a set of operations that can be executed by Ignite or by 
> the user in order to improve cluster state.
> Each building block must either be an automatic action with configurable 
> timeout (if applicable), or a documented API, with mandatory 
> diagnostics/metrics that would allow users to make decisions about these 
> actions.
>  # Offline Ignite node is brought back online, having all recent data.
> _Not a disaster recovery mechanism, but worth mentioning._
> A node with usable data, that doesn’t require full state transfer, will 
> become a peer, will participate in voting and replication, allowing partition 
> to be _available_ if majority is healthy. This is the best case for the user, 
> where they simply restart offline nodes and the cluster continues being 
> operable.
>  # Automatic group scale-down.
> Should happen when an Ignite node is offline for too long.
> Not a disaster recovery mechanism, but worth mentioning.
> Only happens when the majority is online, meaning that user data is safe.
>  # Manual partition restart.
> Should be performed manually for broken peers.
>  # Manual group peers/learners reconfiguration.
> Should be performed on a group manually, if the majority is considered 
> permanently lost.
>  # Freshly re-entering the group.
> Should happen when an Ignite node is returned back to the group, but 
> partition data is missing.
>  # Cleaning the partition data.
> If, for some reason, we know that a certain partition on a certain node is 
> broken, we may ask Ignite to drop its data and re-enter the group empty (as 
> stated in option 5).
> Having a dedicated operation for cleaning the partition is preferable, 
> because:
>  ## partition is be stored in several storages
>  ## not all of them have a “file per partition” storage format, not even close
>  ## there’s also raft log that should be cleaned, most likely
>  ## maybe raft meta as well
>  # Partial truncation of the log’s suffix.
> This is a case of partial cleanup of partition data. This operation might be 
> useful if we know that there’s junk in the log, but storages are not 
> corrupted, so there’s a chance to save some data. Can be replaced with “clean 
> partition data”.
> In order for the user to make decisions about manual operations, we must 
> provide partition states for all partitions in all tables/zones. Both global 
> and local states. Global states are more important, because they directly 
> correlate with user experience.
> Some states will automatically lead to “available” partitions, if the system 
> overall is healthy and we simply wait for some time. For example, we wait 
> until a snapshot installation, or a rebalance is complete, and we’re happy. 
> This is not considered a building block, because it’s a natural artifact of 
> the architecture.
> Current list is not exhaustive, it consists of basic actions that we could 
> implement that would cover a wide range of potential issues.
> Any other addition to the list of basic blocks would simply refine it, 
> potentially allowing users to recover faster, or with less data being lost, 
> or without data loss at all. All such refinements will be considered in later 
> releases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-21140) Ignite 3 Disaster Recovery

Reply via email to