[jira] [Updated] (HDDS-10879) Statemachine transaction resiliency for OM

Sumit Agrawal (Jira) Sun, 19 May 2024 22:42:10 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-10879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sumit Agrawal updated HDDS-10879:
---------------------------------
    Description: 
h2. Concerns of current state machine: 
 # OM crash and unable to provide service due to use of experimental feature - 
not stable in the version
 # OM state machine crash due to bug in specific feature

 

{*}Impact{*}: Crash of OM / service not available
h3. {*}Solution Points{*}:
h4. *1. Need support skip of transaction - through configuration*

Many times for the recovery of the system, this needs support skip of the 
particular transaction. Otherwise the system becomes in-operable.

For code bugs in the operation, users need to make a decision to skip and 
recover the system.

 
h4. *2. Making operation failure smoothly (without terminating) for specific 
transaction*

It can segregate the type of operation which must crash and which can just fail,

*Critical Operation*
 * create/commit and other critical operation which can create in-consistency 
in system

*Non-critical Operation*
 * Internal cleanup, experimental features and other operation which does not 
create big impact to the system and do not cause data loss and further failure, 
and repetitive in nature.

The operation needs to be configured in the configuration file for easy control.
h4. *3. Failing operation (operation timeout)*

Operation taking more time then threshold like, 10 minutes threshold, it should 
be terminated and making it failure. This is like the operation is stuck and/or 
the system is not able to complete due to lack of memory / cpu.

These operations should be failed (critical: causing crash of system, 
non-critical: making it failure) using interrupt.

Already we capture metrics for time taken by these operations.

Configuration of threshold is required.

 
h4. *4. Logging the failed operation*

It should log the failed operation terminated abruptly, with operation and 
transaction Id. This will be useful to know what transaction has failed. 
(currently, it logs only in normal failure).

 
h4. *5. Alternative approach to crash*

Crash mostly happens during ratis transaction (write operation). so instead of 
crashing, write operation can be disabled, and provide only read operation.

This needs some way that the leader is elected (or node is identified providing 
service) to provide read service.
 * 
 ** Does this node need to withdraw from being a leader? If all nodes withdraw 
from being leader, they need to check who will provide read operation.

  was:
h2. Concerns of current state machine: 
 # OM crash and unable to provide service due to use of experimental feature - 
not stable in the version
 # OM state machine crash due to bug in specific feature

 

{*}Impact{*}: Crash of OM / service not available
h3. {*}Solution Points{*}:
h4. *1. Need support skip of transaction - through configuration*

Many times for the recovery of the system, this needs support skip of the 
particular transaction. Otherwise the system becomes in-operable.

For code bugs in the operation, users need to make a decision to skip and 
recover the system.

 
h4. *2. Making operation failure smoothly (without terminating) for specific 
transaction*

It can segregate the type of operation which must crash and which can just fail,

*Critical Operation*
 * create/commit and other critical operation which can create in-consistency 
in system

*Non-critical Operation*
 * Internal cleanup, experimental features and other operation which does not 
create big impact to the system and do not cause data loss and further failure, 
and repetitive in nature.

The operation needs to be configured in the configuration file for easy control.
h4. *3. Failing operation*

Operation taking more time then threshold like, 10 minutes threshold, it should 
be terminated and making it failure. This is like the operation is stuck and/or 
the system is not able to complete due to lack of memory / cpu.

These operations should be failed (critical: causing crash of system, 
non-critical: making it failure) using interrupt.

Already we capture metrics for time taken by these operations.

Configuration of threshold is required.

 
h4. *4. Logging the failed operation*

It should log the failed operation terminated abruptly, with operation and 
transaction Id. This will be useful to know what transaction has failed. 
(currently, it logs only in normal failure).

 
h4. *5. Alternative approach to crash*

Crash mostly happens during ratis transaction (write operation). so instead of 
crashing, write operation can be disabled, and provide only read operation.

This needs some way that the leader is elected (or node is identified providing 
service) to provide read service.

** Does this node need to withdraw from being a leader? If all nodes withdraw 
from being leader, they need to check who will provide read operation.


> Statemachine transaction resiliency for OM
> ------------------------------------------
>
>                 Key: HDDS-10879
>                 URL: https://issues.apache.org/jira/browse/HDDS-10879
>             Project: Apache Ozone
>          Issue Type: Improvement
>            Reporter: Sumit Agrawal
>            Assignee: Sumit Agrawal
>            Priority: Major
>
> h2. Concerns of current state machine: 
>  # OM crash and unable to provide service due to use of experimental feature 
> - not stable in the version
>  # OM state machine crash due to bug in specific feature
>  
> {*}Impact{*}: Crash of OM / service not available
> h3. {*}Solution Points{*}:
> h4. *1. Need support skip of transaction - through configuration*
> Many times for the recovery of the system, this needs support skip of the 
> particular transaction. Otherwise the system becomes in-operable.
> For code bugs in the operation, users need to make a decision to skip and 
> recover the system.
>  
> h4. *2. Making operation failure smoothly (without terminating) for specific 
> transaction*
> It can segregate the type of operation which must crash and which can just 
> fail,
> *Critical Operation*
>  * create/commit and other critical operation which can create in-consistency 
> in system
> *Non-critical Operation*
>  * Internal cleanup, experimental features and other operation which does not 
> create big impact to the system and do not cause data loss and further 
> failure, and repetitive in nature.
> The operation needs to be configured in the configuration file for easy 
> control.
> h4. *3. Failing operation (operation timeout)*
> Operation taking more time then threshold like, 10 minutes threshold, it 
> should be terminated and making it failure. This is like the operation is 
> stuck and/or the system is not able to complete due to lack of memory / cpu.
> These operations should be failed (critical: causing crash of system, 
> non-critical: making it failure) using interrupt.
> Already we capture metrics for time taken by these operations.
> Configuration of threshold is required.
>  
> h4. *4. Logging the failed operation*
> It should log the failed operation terminated abruptly, with operation and 
> transaction Id. This will be useful to know what transaction has failed. 
> (currently, it logs only in normal failure).
>  
> h4. *5. Alternative approach to crash*
> Crash mostly happens during ratis transaction (write operation). so instead 
> of crashing, write operation can be disabled, and provide only read operation.
> This needs some way that the leader is elected (or node is identified 
> providing service) to provide read service.
>  * 
>  ** Does this node need to withdraw from being a leader? If all nodes 
> withdraw from being leader, they need to check who will provide read 
> operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-10879) Statemachine transaction resiliency for OM

Reply via email to