[ 
https://issues.apache.org/jira/browse/HDDS-15501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated HDDS-15501:
-------------------------------
    Description: 
Currently, we only test Ozone using the traditional UT, IT, Acceptance Tests. 
We had a MiniOzoneChaosCluster (fault injection testing), but it seems unused. 
I propose to introduce a distributed system testing and proofs system so that 
we can have the Ozone spec as the shared mental model. Some of the regressions 
for issues like breaking majority commit contract (HDDS-15052) is not detected 
since we don't have the spec as the source of truth. Additionally sometimes 
simply we use our intuitions to guide our implementation and fixes which can 
cause regressions (for example, a lot of ReplicationManager fixes are only done 
only when there is an issue in productions).

This is a parent task for the effort to introduce distributed system testing 
and proofs to test the correctness of Ozone implementation (e.g. partial write 
commit, container state transitions, replication manager, container replica 
management (i.e. how to appease eventually consistent heartbeat and strongly 
consistent Ratis in SCM), quasi closed, block deletion orphan issue, etc).

Distributed system testing tools:
 - Jepsen, Ellen, Maelstorm
 - Fray
 - Hypothesis (Hegel)
 - Antithesis (paid)

Distributed system proofs:
 - TLA+
 - Lean4
 - P framework

Real systems
 - 3FS ([https://github.com/deepseek-ai/3FS/tree/main/specs]) - uses P framework
 - AWS S3 
([https://cacm.acm.org/practice/systems-correctness-practices-at-amazon-web-services/]
 and [https://p-org.github.io/P/casestudies/#case-studies])
 - etcd robustness test 
([https://github.com/etcd-io/etcd/tree/main/tests/robustness]) - Uses 
antithesis (among other things)

I prefer if we can start with P framework since some storage systems already 
used it.

Having a real industry-recognized spec helps to instil confidence in Ozone 
robustness. More importantly, distributed system testing allows us to have 
confidence that our changes will not introduce critical issues (as long as the 
system is covered by the test).

  was:
Currently, we only test Ozone using the traditional UT, IT, Acceptance Tests. 
We had a MiniOzoneChaosCluster (fault injection testing), but it seems unused. 
I propose to introduce a distributed system testing and proofs system so that 
we can have the Ozone spec as the shared mental model. Some of the regressions 
for issues like breaking majority commit contract (HDDS-15052) is not detected 
since we don't have the spec as the source of truth. Additionally  sometimes 
simply we use our intuitions to guide our implementation and fixes.

This is a parent task for the effort to introduce distributed system testing 
and proofs to test the correctness of Ozone implementation (e.g. partial write 
commit, container state transitions, quasi closed, etc).

Distributed system testing tools:

- Jepsen, Ellen, Maelstorm
- Fray
- Hypothesis (Hegel)
- Antithesis (paid) 

Distributed system proofs:
- TLA+
- Lean4
- P framework

Real systems
- 3FS (https://github.com/deepseek-ai/3FS/tree/main/specs) - uses P framework
- AWS S3 
(https://cacm.acm.org/practice/systems-correctness-practices-at-amazon-web-services/
 and https://p-org.github.io/P/casestudies/#case-studies)
- etcd robustness test 
(https://github.com/etcd-io/etcd/tree/main/tests/robustness) - Uses antithesis 
(among other things) 

I prefer if we can start with P framework since some storage systems already 
used it.

Having a real industry-recognized spec helps to instil confidence in Ozone 
robustness. More importantly, distributed system testing allows us to have 
confidence that our changes will not introduce critical issues (as long as the 
system is covered by the test).


> Distributed System Testing in Ozone
> -----------------------------------
>
>                 Key: HDDS-15501
>                 URL: https://issues.apache.org/jira/browse/HDDS-15501
>             Project: Apache Ozone
>          Issue Type: Test
>          Components: test
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> Currently, we only test Ozone using the traditional UT, IT, Acceptance Tests. 
> We had a MiniOzoneChaosCluster (fault injection testing), but it seems 
> unused. I propose to introduce a distributed system testing and proofs system 
> so that we can have the Ozone spec as the shared mental model. Some of the 
> regressions for issues like breaking majority commit contract (HDDS-15052) is 
> not detected since we don't have the spec as the source of truth. 
> Additionally sometimes simply we use our intuitions to guide our 
> implementation and fixes which can cause regressions (for example, a lot of 
> ReplicationManager fixes are only done only when there is an issue in 
> productions).
> This is a parent task for the effort to introduce distributed system testing 
> and proofs to test the correctness of Ozone implementation (e.g. partial 
> write commit, container state transitions, replication manager, container 
> replica management (i.e. how to appease eventually consistent heartbeat and 
> strongly consistent Ratis in SCM), quasi closed, block deletion orphan issue, 
> etc).
> Distributed system testing tools:
>  - Jepsen, Ellen, Maelstorm
>  - Fray
>  - Hypothesis (Hegel)
>  - Antithesis (paid)
> Distributed system proofs:
>  - TLA+
>  - Lean4
>  - P framework
> Real systems
>  - 3FS ([https://github.com/deepseek-ai/3FS/tree/main/specs]) - uses P 
> framework
>  - AWS S3 
> ([https://cacm.acm.org/practice/systems-correctness-practices-at-amazon-web-services/]
>  and [https://p-org.github.io/P/casestudies/#case-studies])
>  - etcd robustness test 
> ([https://github.com/etcd-io/etcd/tree/main/tests/robustness]) - Uses 
> antithesis (among other things)
> I prefer if we can start with P framework since some storage systems already 
> used it.
> Having a real industry-recognized spec helps to instil confidence in Ozone 
> robustness. More importantly, distributed system testing allows us to have 
> confidence that our changes will not introduce critical issues (as long as 
> the system is covered by the test).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to