ifesdjeen commented on code in PR #3005:
URL: https://github.com/apache/cassandra/pull/3005#discussion_r1434529331
##########
test/fuzz/main/README.md:
##########
@@ -0,0 +1,658 @@
+# Harry, a fuzz testing tool for Apache Cassandra
+
+The project aims to generate _reproducible_ workloads that are as close to
real-life as possible, while being able to
+_efficiently_ verify the cluster state against the model without pausing the
workload itself.
+
+## Getting Started in under 5 minutes
+
+Harry can operate as a straightforward read/write "correctness stress tool"
that will check to ensure reads on a cluster
+are consistent with what it knows it wrote to the cluster. You have a couple
options for this.
+
+### Option 2: Running things manually lower in the stack:
+
+The make file has a stress target where you can more directly access all
available ARGS rather than restricting yourself
+to the convenience script above. If you're using an external cluster (i.e.
`./bin/cassandra -f`, CCM, docker,
+kubernetes, or just a deployed cluster), the mini stress tool can be used
directly as follows:
+
+To start a workload that performs a concurrent read/write workload, 2 read and
2 write threads for 60 seconds
+against a in-jvm cluster you can use the following code:
+
+```
+try (Cluster cluster = builder().withNodes(3)
+ .start())
+{
+ SchemaSpec schema = new SchemaSpec("harry", "test_table",
+ asList(pk("pk1", asciiType), pk("pk1",
int64Type)),
+ asList(ck("ck1", asciiType), ck("ck1",
int64Type)),
+ asList(regularColumn("regular1",
asciiType), regularColumn("regular1", int64Type)),
+ asList(staticColumn("static1",
asciiType), staticColumn("static1", int64Type)));
+
+ Configuration config = HarryHelper.defaultConfiguration()
+ .setKeyspaceDdl(String.format("CREATE
KEYSPACE IF NOT EXISTS %s WITH replication = {'class': 'SimpleStrategy',
'replication_factor': %d};", schema.keyspace, 3))
+ .setSUT(() -> new InJvmSut(cluster))
+ .build();
+
+ Run run = config.createRun();
+
+ concurrent(run, config,
+ asList(pool("Writer", 2, MutatingVisitor::new),
+ pool("Reader", 2, RandomPartitionValidator::new)),
+ 2, TimeUnit.MINUTES)
+ .run();
+}
+```
+
+# I've found a falsification. What now?
+
+There is no one-size-fits-all solution for debugging a falsification. We did
try to create a shrinker, but unfortunately
+without Simulator, shrinker only works for issues that are non-concurrent in
nature, since there's no way to create a
+stable repro otherwise. That said, there are several things that might get you
started and inspire further ideas about
+how to debug the issue.
+
+First of all, understand whether or not the issue is likely to be concurrent
in nature. If you re-run your test with the
+same seed, but see no falsification, and it fails only sporadically, and often
on different logical timestamp, it is
+likely that the issue is, in fact concurrent. Here, it is important to note
that when you are running concurrent
+read/write workload, you will get different interleaving of reads and writes
every time you do this. If you have reasons
+to think that you're seeing the falsification because a read has queried a
specific partition state, try re-running your
+test with sequential runner (`--write-before-read`) if you are using
ministress.
+
+If you can get a stable repro with a sequential runner, you're in luck. Now
all you need to do is to add logs everywhere
+and understand what might be causing it. But even if you do not have a stable
repro, you are still likely to follow the
+same steps:
+
+* Inspect the error itself. Do Cassandra-returned results make sense? Is
anything out of order? Are there any duplicates
+ or gaps?
+* Switch to logging mutating visitor and closely inspect its output. Closely
inspect the output of the model. Do the
+ values make sense?
+* Check the output of data tracker. Does the model or Cassandra have missing
columns or rows? Do these outputs contain
+ latest logical timestamps for each of the operations from the log? How about
in-flight operations?
+* Filter out relevant operation log entries and inspect them closely. Given
these operations, does the output of the
+ model, or output of the database make most sense?
+
+Next, you might want to try to narrow down the scope of the problem. Depending
on what the falsification looks like, use
+your Cassandra knowledge to see what might apply in your situation:
+
+* Try checking if changing schema to use different column types does anything.
+* Try disabling range deletes, regular deletes, or column deletes.
+* Try changing the size of partition and see if the issue still reproduces.
+* Try disabling static columns.
+
+To avoid listing every feature in Harry, it suffices to say you should try to
enable/disable features that make sense
+in the given context, and try to find the combination that avoids the failure,
or a minimal combination that still
+reproduces the issue. Your first goal should be to find a _stable repro_, even
if it involves modifying Cassandra or
+Harry, or taking the operations, and composing the repro manually. Having a
stable repro will make finding a cause much
+simpler. Sometimes you will find the cause before you have a stable repro, in
which case, you _still_ have to produce a
+stable repro to make things simpler for the reviewer, and to include it into
the test suite of your patch.
+
+Lastly, *be patient*. Debugging falsifications is often a multi-hour
endeavour, and things do not always jump out at you,
+so you might have to spend a significant amount of time tracking the problem
down. Once you have found it, it is very
+rewarding.
+
+## Further Reading
+* [Harry: An open-source fuzz testing and verification tool for Apache
Cassandra](https://cassandra.apache.org/_/blog/Harry-an-Open-Source-Fuzz-Testing-and-Verification-Tool-for-Apache-Cassandra.html)
+
+---
+# Technical and Implementation Details
+
+## System Under Test implementations
+
+* `in_jvm/InJvmSut` - simple in-JVM-dtest system under test.
+* `println/PrintlnSut` - system under test that prints to sdtout instead of
executing queries on the cluster; useful for
+ debugging.
+* `mixed_in_jvm/MixedVersionInJvmSut` - in-JVM-dtest system under test that
works with mixed version clusters.
+* `external/ExternalClusterSut` - system under test that works with CCM,
Docker, Kubernetes, or cluster you may. have
+ deployed elsewhere
+
+Both in-JVM SUTs have fault-injecting functionality available.
+
+## Visitors
+
+* `single/SingleValidator` - visitor that runs several different read queries
against a single partition that is
+ associated with current logical timestamp, and validates their results using
given model.
+* `all_partitions/AllPartitionsValidator` - concurrently validates all
partitions that were visited during this run.
+* `repair_and_validate_local_states/RepairingLocalStateValidator` - similar to
`AllPartitionsValidator`, but performs
+ repair before checking node states.
+* `mutating/MutatingVisitor` - visitor that performs all sorts of mutations.
+* `logging/LoggingVisitor` - similar to `MutatingVisitor`, but also logs all
operations to a file; useful for debug
+ purposes.
+* `corrupting/CorruptingVisitor` - visitor that will deliberately change data
in the partition it visits. Useful for
+ negative tests (i.e. to ensure that your model actually detects data
inconsistencies).
+
+And more.
+
+## Models
+
+* `querying_no_op/QueryingNoOpValidator` - a model that can be used to
"simply" run random queries.
+* `quiescent_checker/QuiescentChecker` - a model that can be used to verify
results of any read that has no writes to
+ the same partition_ concurrent to it. Should be used in conjunction with
locking data tracker.
+* `quiescent_local_state_checker/QuiescentLocalStateChecker` - a model that
can check local states of each replica that
+ has to own
+
+## Runners
+
+* `sequential/SequentialRunner` - runs all visitors sequentially, in the loop,
for a specified amount of time; useful
+ for simple tests that do not have to exercise concurrent read/write path.
+* `concurrent/ConcurrentRunner` - runs all visitors concurrently, each visitor
in its own thread, looped, for a
+ specified amount of time; useful for things like concurrent read/write
workloads.
+* `chain/ChainRunner` - receives other runners as input, and runs them one
after another once. Useful for both simple
+ and complex scenarios that involve both read/write workloads, validating all
partitions, exercising other node-local
+ or cluster-wide operations.
+* `staged/StagedRunner` - receives other runners (stages) as input, and runs
them one after another in a loop; useful
+ for implementing complex scenarios, such as read/write workloads followed by
some cluster changing operations.
+
+## Clock
+
+* `approximate_monotonic/ApproximateMonotonicClock` - a timestamp supplier
implementation that tries to keep as close to
+ real time as possible, while preserving mapping from real-time to logical
timestamps.
+* `offset/OffsetClock` - a (monotonic) clock that supplies timestamps that do
not have any relation to real time.
+
+# Introduction
+
+Harry has two primary modes of functionality:
+
+* Unit test mode: in which you define specific sequences of
+ operations and let Harry test these operations using different
+ schemas and conditions.
+* Exploratory/fuzz mode: in which you define distributions of events
+ rather rather than sequences themselves, and let Harry try out
+ different things.
+
+Usually, in unit-test mode, we’re applying several write operations to the
cluster state and then run different read
+queries and validate their results. To learn more about writing unit tests,
refer to the "Writing Unit Tests" section.
+
+In exploratory mode, we continuously apply write operations to the cluster and
validate their state, allowing data size
+to grow and simulating real-life behaviour. To learn more about implementing
test cases using fuzz mode, refer to the "
+Implementing Tests" section of this guide, but it's likely you'll have to read
the rest of this document to implement
+more complex scenarios.
+
+# Writing Unit Tests
+
+To write unit tests with Harry, there's no special knowledge required.
Usually, unit tests are written by simply
+hardcoding the schema and then writing several modification statements one
after the other, and then manually validating
+results of a `SELECT` query. This might work for simple scenarios, but there’s
still a chance that for some other schema
+or some combination of values the tested feature may not work.
+
+To improve the situation, we can express the test in more abstract terms and,
instead of writing it using specific
+statements, we can describe which statement _types_ are to be used:
+
+```
+test(new SchemaGenerators.Builder("harry")
+ .partitionKeySpec(1, 5)
+ .clusteringKeySpec(1, 5)
+ .regularColumnSpec(1, 10)
+ .generator(),
+ historyBuilder -> {
+ historyBuilder.insert();
+ historyBuilder.deletePartition();
+ historyBuilder.deleteRowSlice();
+ });
+```
+
+This spec can be used to generate clusters of different sizes, configured with
different schemas, executing the given a
+sequence of actions both in isolation, and combined with other randomly
generated ones, with failure-injection.
+
+Best of all is that this test will _not only_ ensure that such a sequence of
actions does not produce an exception, but
+also would ensure that cluster will respond with correct results to _any_
allowed read query.
+
+To begin specifying operations for a new partition, either start calling
methods on the `HistoryBuilder`, or, if you
+would like to specify the partition which Harry needs to visit use
`#visitPartition` or `#beginBatch` have to be called,
+for starting a visit to a particular partition with a single or multiple
actions.
+
+After that, the actions are self-explanatory: `#insert`, `#update`,
`#deleteRow`, `#deleteColumns`, `#deleteRowRange`, `#deleteRowSlice`
+`#deletePartition`.
+
+After history generated by `HistoryBuilder` is replayed using
`ReplayingVisitor` (or by using a `ReplayingHistoryBuilder`
+which combines the two for your convenience), you can use any model
(`QuiescentChecker` by default) to validate queries.
+Queries can be provided manually or generated using `QueryGenerator` or
`TypedQueryGenerator`.
+
+# Basic Terminology
+
+* Inflate / inflatable: a process of producing a value (for example, string,
or a blob) from a `long` descriptor that
+ uniquely identifies the value. See [data
generation](https://github.com/apache/cassandra-harry#data-generation)
+ section of this guide for more details.
+* Deflate / deflatable: a process of producing the descriptor the value was
inflated from during verification.
+ See [model](https://github.com/apache/cassandra-harry#model) section of this
guide for more details.
+
+For definitions of logical timestamp, descriptor, and other entities used
during inflation and deflation, refer
+to [formal
relationships](https://github.com/apache/cassandra-harry#formal-relations-between-entities)
section.
+
+# Features
+
+Currently, Harry can exercise the following Cassandra functionality:
+
+* Supported data types: `int8`, `int16`, `int32`, `int64`, `boolean`, `float`,
`double`, `ascii`, `uuid`, `timestamp`.
+ Collections are only _inflatable_.
+* Random schema generation, with an arbitrary number of partition and
clustering keys.
+* Schemas with arbitrary `CLUSTERING ORDER BY`
+* Randomly generated `INSERT` and `UPDATE` queries with all columns or
arbitrary column subset
+* Randomly generated `DELETE` queries: for a single column, single row, or a
range of rows
+* Inflating and validating entire partitions (with allowed in-flight queries)
+* Inflating and validating random `SELECT` queries: single row, slices (with
single open end), and ranges (with both
+ ends of clusterings specified)
+
+Inflating partitions is done
+using
[Reconciler](https://github.com/apache/cassandra-harry/blob/master/harry-core/src/harry/reconciler/Reconciler.java).
+Validating partitions and random queries can be done
+using [Quiescent
Checker](https://github.com/apache/cassandra-harry/blob/master/harry-core/src/harry/model/QuiescentChecker.java)
+and [Exhaustive
Checker](https://github.com/apache/cassandra-harry/blob/master/harry-core/src/harry/model/ExhaustiveChecker.java).
Review Comment:
I've removed the navigation references for now. I think it should be
straightforward to search for sections. Maybe it would be good to split the doc
into segments/subdocs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]