Proposal for a mechanism to evaluate whole clusters, or individual classes, with a deterministically pseudorandom ordering of all thread and message events.
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-10%3A+Cluster+and+Code+Simulations Evaluating the correctness of distributed systems is hard, as Iām sure every developer on this list appreciates. As the project has matured, we have had to grapple more with the guarantees we provide users for features we develop, and the semantics we promise, particularly around edge-cases between two mechanisms or systems. This work aims to dramatically reduce the project overhead necessary for delivering a bug-free Cassandra. The premise is to intercept all relevant events that could be performed in a different order, i.e. primarily message delivery and thread events such as executor submission, signalling of threads, lock acquisition and release, and even volatile reads and writes (to a lesser extent). These events are then scheduled pseudo-randomly (with various restrictions to ensure a valid execution), or in some cases not evaluated at all (to simulate e.g. messages being lost). The result is a repeatable sequential evaluation of a multi-threaded, multi-actor system. This permits us to evaluate a much broader range of cluster behaviours without any additional development work, permitting us to implement a broad range of property-based and related randomized acceptance tests, without significant developer burden. The work will apply just as readily to multi-threaded single classes as it will to whole clusters, and will come with a linearizability test for LWTs as well as a unit test for an existing multi-threaded bug that is otherwise hard to exhibit. To achieve this, significant modifications will be required to the codebase, mostly cleaning up existing abstractions. Specifically, we will need to be able to mock executors, any blocking concurrency primitives, time, filesystem access and internode streaming. The work is ā in large part ā already complete, with JIRA and PRs to follow in the coming weeks. Of course, the work is subject to the usual community input and review, so this does not preclude changes to the work (even significant ones, if they are warranted). I know a lot of incoming CEP are likely to be backed up by significant off-list development as a result of the focus on a shippable 4.0. Hopefully this is just a temporary growing pain, particularly as we move towards a shippable trunk. I hope this work will be of huge value to the project, particularly as we race to catch up on years of limited feature development. JIRA and PRs will follow, but I wanted to kick-off discussion in advance.