[
https://issues.apache.org/jira/browse/HDDS-15501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18087608#comment-18087608
]
Ivan Andika commented on HDDS-15501:
------------------------------------
Thanks for the reference. TLA+ might work, but seems to be too mathematical. I
have seen it be applied to ZooKeeper, Raft, etc but not sp much on storage
system
> I've not tried myself but I suspect we can leverage AI for TLA+.
Yes, we can try to use https://github.com/shenli/distributed-system-testing ,
but I think we should try to write it by hand first since I suspect I don't
think we can tell between valid and invalid spec.
> Distributed System Testing in Ozone
> -----------------------------------
>
> Key: HDDS-15501
> URL: https://issues.apache.org/jira/browse/HDDS-15501
> Project: Apache Ozone
> Issue Type: Test
> Components: test
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
>
> Currently, we only test Ozone using the traditional UT, IT, Acceptance Tests.
> We had a MiniOzoneChaosCluster (fault injection testing), but it seems
> unused. I propose to introduce a distributed system testing and proofs system
> so that we can have the Ozone spec as the shared mental model. Some of the
> regressions for issues like breaking majority commit contract (HDDS-15052) is
> not detected since we don't have the spec as the source of truth.
> Additionally sometimes simply we use our intuitions to guide our
> implementation and fixes.
> This is a parent task for the effort to introduce distributed system testing
> and proofs to test the correctness of Ozone implementation (e.g. partial
> write commit, container state transitions, quasi closed, etc).
> Distributed system testing tools:
> - Jepsen, Ellen, Maelstorm
> - Fray
> - Hypothesis (Hegel)
> - Antithesis (paid)
> Distributed system proofs:
> - TLA+
> - Lean4
> - P framework
> Real systems
> - 3FS (https://github.com/deepseek-ai/3FS/tree/main/specs) - uses P framework
> - AWS S3
> (https://cacm.acm.org/practice/systems-correctness-practices-at-amazon-web-services/
> and https://p-org.github.io/P/casestudies/#case-studies)
> - etcd robustness test
> (https://github.com/etcd-io/etcd/tree/main/tests/robustness) - Uses
> antithesis (among other things)
> I prefer if we can start with P framework since some storage systems already
> used it.
> Having a real industry-recognized spec helps to instil confidence in Ozone
> robustness. More importantly, distributed system testing allows us to have
> confidence that our changes will not introduce critical issues (as long as
> the system is covered by the test).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]