Proposal for a mechanism to evaluate whole clusters, or individual classes, 
with a deterministically pseudorandom ordering of all thread and message events.

https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-10%3A+Cluster+and+Code+Simulations

Evaluating the correctness of distributed systems is hard, as Iā€™m sure every 
developer on this list appreciates. As the project has matured, we have had to 
grapple more with the guarantees we provide users for features we develop, and 
the semantics we promise, particularly around edge-cases between two mechanisms 
or systems.

This work aims to dramatically reduce the project overhead necessary for 
delivering a bug-free Cassandra.

The premise is to intercept all relevant events that could be performed in a 
different order, i.e. primarily message delivery and thread events such as 
executor submission, signalling of threads, lock acquisition and release, and 
even volatile reads and writes (to a lesser extent). These events are then 
scheduled pseudo-randomly (with various restrictions to ensure a valid 
execution), or in some cases not evaluated at all (to simulate e.g. messages 
being lost). The result is a repeatable sequential evaluation of a 
multi-threaded, multi-actor system.

This permits us to evaluate a much broader range of cluster behaviours without 
any additional development work, permitting us to implement a broad range of 
property-based and related randomized acceptance tests, without significant 
developer burden.

The work will apply just as readily to multi-threaded single classes as it will 
to whole clusters, and will come with a linearizability test for LWTs as well 
as a unit test for an existing multi-threaded bug that is otherwise hard to 
exhibit.

To achieve this, significant modifications will be required to the codebase, 
mostly cleaning up existing abstractions. Specifically, we will need to be able 
to mock executors, any blocking concurrency primitives, time, filesystem access 
and internode streaming.

The work is ā€“ in large part ā€“ already complete, with JIRA and PRs to follow in 
the coming weeks. Of course, the work is subject to the usual community input 
and review, so this does not preclude changes to the work (even significant 
ones, if they are warranted). I know a lot of incoming CEP are likely to be 
backed up by significant off-list development as a result of the focus on a 
shippable 4.0. Hopefully this is just a temporary growing pain, particularly as 
we move towards a shippable trunk.

I hope this work will be of huge value to the project, particularly as we race 
to catch up on years of limited feature development.

JIRA and PRs will follow, but I wanted to kick-off discussion in advance.

Reply via email to