[DISCUSS] CEP-10: Cluster and Code Simulations

bened...@apache.org Thu, 03 Jun 2021 12:19:28 -0700

Proposal for a mechanism to evaluate whole clusters, or individual classes, 
with a deterministically pseudorandom ordering of all thread and message events.

https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-10%3A+Cluster+and+Code+Simulations

Evaluating the correctness of distributed systems is hard, as I’m sure every
developer on this list appreciates. As the project has matured, we have had to
grapple more with the guarantees we provide users for features we develop, and
the semantics we promise, particularly around edge-cases between two mechanisms
or systems.

This work aims to dramatically reduce the project overhead necessary for
delivering a bug-free Cassandra.

The premise is to intercept all relevant events that could be performed in a
different order, i.e. primarily message delivery and thread events such as
executor submission, signalling of threads, lock acquisition and release, and
even volatile reads and writes (to a lesser extent). These events are then
scheduled pseudo-randomly (with various restrictions to ensure a valid
execution), or in some cases not evaluated at all (to simulate e.g. messages
being lost). The result is a repeatable sequential evaluation of a
multi-threaded, multi-actor system.

This permits us to evaluate a much broader range of cluster behaviours without
any additional development work, permitting us to implement a broad range of
property-based and related randomized acceptance tests, without significant
developer burden.

The work will apply just as readily to multi-threaded single classes as it will
to whole clusters, and will come with a linearizability test for LWTs as well
as a unit test for an existing multi-threaded bug that is otherwise hard to
exhibit.

To achieve this, significant modifications will be required to the codebase,
mostly cleaning up existing abstractions. Specifically, we will need to be able
to mock executors, any blocking concurrency primitives, time, filesystem access
and internode streaming.

The work is – in large part – already complete, with JIRA and PRs to follow in
the coming weeks. Of course, the work is subject to the usual community input
and review, so this does not preclude changes to the work (even significant
ones, if they are warranted). I know a lot of incoming CEP are likely to be
backed up by significant off-list development as a result of the focus on a
shippable 4.0. Hopefully this is just a temporary growing pain, particularly as
we move towards a shippable trunk.

I hope this work will be of huge value to the project, particularly as we race
to catch up on years of limited feature development.

JIRA and PRs will follow, but I wanted to kick-off discussion in advance.

[DISCUSS] CEP-10: Cluster and Code Simulations

Reply via email to