Re: [PR] Bring Harry into C* Tree [cassandra]

via GitHub Thu, 21 Dec 2023 13:32:46 -0800


ifesdjeen commented on code in PR #3005:
URL: https://github.com/apache/cassandra/pull/3005#discussion_r1434529331



##########
test/fuzz/main/README.md:
##########
@@ -0,0 +1,658 @@
+# Harry, a fuzz testing tool for Apache Cassandra
+
+The project aims to generate _reproducible_ workloads that are as close to 
real-life as possible, while being able to
+_efficiently_ verify the cluster state against the model without pausing the 
workload itself.
+
+## Getting Started in under 5 minutes
+
+Harry can operate as a straightforward read/write "correctness stress tool" 
that will check to ensure reads on a cluster
+are consistent with what it knows it wrote to the cluster. You have a couple 
options for this.
+
+### Option 2: Running things manually lower in the stack:
+
+The make file has a stress target where you can more directly access all 
available ARGS rather than restricting yourself
+to the convenience script above. If you're using an external cluster (i.e. 
`./bin/cassandra -f`, CCM, docker,
+kubernetes, or just a deployed cluster), the mini stress tool can be used 
directly as follows:
+
+To start a workload that performs a concurrent read/write workload, 2 read and 
2 write threads for 60 seconds
+against a in-jvm cluster you can use the following code:
+
+```
+try (Cluster cluster = builder().withNodes(3)
+                                .start())
+{
+    SchemaSpec schema = new SchemaSpec("harry", "test_table",
+                                       asList(pk("pk1", asciiType), pk("pk1", 
int64Type)),
+                                       asList(ck("ck1", asciiType), ck("ck1", 
int64Type)),
+                                       asList(regularColumn("regular1", 
asciiType), regularColumn("regular1", int64Type)),
+                                       asList(staticColumn("static1", 
asciiType), staticColumn("static1", int64Type)));
+
+    Configuration config = HarryHelper.defaultConfiguration()
+                                        .setKeyspaceDdl(String.format("CREATE 
KEYSPACE IF NOT EXISTS %s WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': %d};", schema.keyspace, 3))
+                                        .setSUT(() -> new InJvmSut(cluster))
+                                        .build();
+
+    Run run = config.createRun();
+
+    concurrent(run, config,
+               asList(pool("Writer", 2, MutatingVisitor::new),
+                      pool("Reader", 2, RandomPartitionValidator::new)),
+               2, TimeUnit.MINUTES)
+    .run();
+}
+```
+
+# I've found a falsification. What now?
+
+There is no one-size-fits-all solution for debugging a falsification. We did 
try to create a shrinker, but unfortunately
+without Simulator, shrinker only works for issues that are non-concurrent in 
nature, since there's no way to create a
+stable repro otherwise. That said, there are several things that might get you 
started and inspire further ideas about
+how to debug the issue.
+
+First of all, understand whether or not the issue is likely to be concurrent 
in nature. If you re-run your test with the
+same seed, but see no falsification, and it fails only sporadically, and often 
on different logical timestamp, it is
+likely that the issue is, in fact concurrent. Here, it is important to note 
that when you are running concurrent
+read/write workload, you will get different interleaving of reads and writes 
every time you do this. If you have reasons
+to think that you're seeing the falsification because a read has queried a 
specific partition state, try re-running your
+test with sequential runner (`--write-before-read`) if you are using 
ministress.
+
+If you can get a stable repro with a sequential runner, you're in luck. Now 
all you need to do is to add logs everywhere
+and understand what might be causing it. But even if you do not have a stable 
repro, you are still likely to follow the
+same steps:
+
+* Inspect the error itself. Do Cassandra-returned results make sense? Is 
anything out of order? Are there any duplicates
+  or gaps?
+* Switch to logging mutating visitor and closely inspect its output. Closely 
inspect the output of the model. Do the
+  values make sense?
+* Check the output of data tracker. Does the model or Cassandra have missing 
columns or rows? Do these outputs contain
+  latest logical timestamps for each of the operations from the log? How about 
in-flight operations?
+* Filter out relevant operation log entries and inspect them closely. Given 
these operations, does the output of the
+  model, or output of the database make most sense?
+
+Next, you might want to try to narrow down the scope of the problem. Depending 
on what the falsification looks like, use
+your Cassandra knowledge to see what might apply in your situation:
+
+* Try checking if changing schema to use different column types does anything.
+* Try disabling range deletes, regular deletes, or column deletes.
+* Try changing the size of partition and see if the issue still reproduces.
+* Try disabling static columns.
+
+To avoid listing every feature in Harry, it suffices to say you should try to 
enable/disable features that make sense
+in the given context, and try to find the combination that avoids the failure, 
or a minimal combination that still
+reproduces the issue. Your first goal should be to find a _stable repro_, even 
if it involves modifying Cassandra or
+Harry, or taking the operations, and composing the repro manually. Having a 
stable repro will make finding a cause much
+simpler. Sometimes you will find the cause before you have a stable repro, in 
which case, you _still_ have to produce a
+stable repro to make things simpler for the reviewer, and to include it into 
the test suite of your patch.
+
+Lastly, *be patient*. Debugging falsifications is often a multi-hour 
endeavour, and things do not always jump out at you,
+so you might have to spend a significant amount of time tracking the problem 
down. Once you have found it, it is very
+rewarding.
+
+## Further Reading
+* [Harry: An open-source fuzz testing and verification tool for Apache 
Cassandra](https://cassandra.apache.org/_/blog/Harry-an-Open-Source-Fuzz-Testing-and-Verification-Tool-for-Apache-Cassandra.html)
+
+---
+# Technical and Implementation Details
+
+## System Under Test implementations
+
+* `in_jvm/InJvmSut` - simple in-JVM-dtest system under test.
+* `println/PrintlnSut` - system under test that prints to sdtout instead of 
executing queries on the cluster; useful for
+  debugging.
+* `mixed_in_jvm/MixedVersionInJvmSut` - in-JVM-dtest system under test that 
works with mixed version clusters.
+* `external/ExternalClusterSut` - system under test that works with CCM, 
Docker, Kubernetes, or cluster you may. have
+  deployed elsewhere
+
+Both in-JVM SUTs have fault-injecting functionality available.
+
+## Visitors
+
+* `single/SingleValidator` - visitor that runs several different read queries 
against a single partition that is
+  associated with current logical timestamp, and validates their results using 
given model.
+* `all_partitions/AllPartitionsValidator` - concurrently validates all 
partitions that were visited during this run.
+* `repair_and_validate_local_states/RepairingLocalStateValidator` - similar to 
`AllPartitionsValidator`, but performs
+  repair before checking node states.
+* `mutating/MutatingVisitor` - visitor that performs all sorts of mutations.
+* `logging/LoggingVisitor` - similar to `MutatingVisitor`, but also logs all 
operations to a file; useful for debug
+  purposes.
+* `corrupting/CorruptingVisitor` - visitor that will deliberately change data 
in the partition it visits. Useful for
+  negative tests (i.e. to ensure that your model actually detects data 
inconsistencies).
+
+And more.
+
+## Models
+
+* `querying_no_op/QueryingNoOpValidator` - a model that can be used to 
"simply" run random queries.
+* `quiescent_checker/QuiescentChecker` - a model that can be used to verify 
results of any read that has no writes to
+  the same partition_ concurrent to it. Should be used in conjunction with 
locking data tracker.
+* `quiescent_local_state_checker/QuiescentLocalStateChecker` - a model that 
can check local states of each replica that
+  has to own
+
+## Runners
+
+* `sequential/SequentialRunner` - runs all visitors sequentially, in the loop, 
for a specified amount of time; useful
+  for simple tests that do not have to exercise concurrent read/write path.
+* `concurrent/ConcurrentRunner` - runs all visitors concurrently, each visitor 
in its own thread, looped, for a
+  specified amount of time; useful for things like concurrent read/write 
workloads.
+* `chain/ChainRunner` - receives other runners as input, and runs them one 
after another once. Useful for both simple
+  and complex scenarios that involve both read/write workloads, validating all 
partitions, exercising other node-local
+  or cluster-wide operations.
+* `staged/StagedRunner` - receives other runners (stages) as input, and runs 
them one after another in a loop; useful
+  for implementing complex scenarios, such as read/write workloads followed by 
some cluster changing operations.
+
+## Clock
+
+* `approximate_monotonic/ApproximateMonotonicClock` - a timestamp supplier 
implementation that tries to keep as close to
+  real time as possible, while preserving mapping from real-time to logical 
timestamps.
+* `offset/OffsetClock` - a (monotonic) clock that supplies timestamps that do 
not have any relation to real time.
+
+# Introduction
+
+Harry has two primary modes of functionality:
+
+* Unit test mode: in which you define specific sequences of
+  operations and let Harry test these operations using different
+  schemas and conditions.
+* Exploratory/fuzz mode: in which you define distributions of events
+  rather rather than sequences themselves, and let Harry try out
+  different things.
+
+Usually, in unit-test mode, we’re applying several write operations to the 
cluster state and then run different read
+queries and validate their results. To learn more about writing unit tests, 
refer to the "Writing Unit Tests" section.
+
+In exploratory mode, we continuously apply write operations to the cluster and 
validate their state, allowing data size
+to grow and simulating real-life behaviour. To learn more about implementing 
test cases using fuzz mode, refer to the "
+Implementing Tests" section of this guide, but it's likely you'll have to read 
the rest of this document to implement
+more complex scenarios.
+
+# Writing Unit Tests
+
+To write unit tests with Harry, there's no special knowledge required. 
Usually, unit tests are written by simply
+hardcoding the schema and then writing several modification statements one 
after the other, and then manually validating
+results of a `SELECT` query. This might work for simple scenarios, but there’s 
still a chance that for some other schema
+or some combination of values the tested feature may not work.
+
+To improve the situation, we can express the test in more abstract terms and, 
instead of writing it using specific
+statements, we can describe which statement _types_ are to be used:
+
+```
+test(new SchemaGenerators.Builder("harry")
+                         .partitionKeySpec(1, 5)
+                         .clusteringKeySpec(1, 5)
+                         .regularColumnSpec(1, 10)
+                         .generator(),
+     historyBuilder -> {
+         historyBuilder.insert();
+         historyBuilder.deletePartition();
+         historyBuilder.deleteRowSlice();
+     });
+```
+
+This spec can be used to generate clusters of different sizes, configured with 
different schemas, executing the given a
+sequence of actions both in isolation, and combined with other randomly 
generated ones, with failure-injection.
+
+Best of all is that this test will _not only_ ensure that such a sequence of 
actions does not produce an exception, but
+also would ensure that cluster will respond with correct results to _any_ 
allowed read query.
+
+To begin specifying operations for a new partition, either start calling 
methods on the `HistoryBuilder`, or, if you 
+would like to specify the partition which Harry needs to visit use 
`#visitPartition` or `#beginBatch` have to be called, 
+for starting a visit to a particular partition with a single or multiple 
actions.
+
+After that, the actions are self-explanatory: `#insert`, `#update`, 
`#deleteRow`, `#deleteColumns`, `#deleteRowRange`, `#deleteRowSlice`
+`#deletePartition`.
+
+After history generated by `HistoryBuilder` is replayed using 
`ReplayingVisitor` (or by using a `ReplayingHistoryBuilder` 
+which combines the two for your convenience), you can use any model 
(`QuiescentChecker` by default) to validate queries.
+Queries can be provided manually or generated using `QueryGenerator` or 
`TypedQueryGenerator`.
+
+# Basic Terminology
+
+* Inflate / inflatable: a process of producing a value (for example, string, 
or a blob) from a `long` descriptor that
+  uniquely identifies the value. See [data 
generation](https://github.com/apache/cassandra-harry#data-generation)
+  section of this guide for more details.
+* Deflate / deflatable: a process of producing the descriptor the value was 
inflated from during verification.
+  See [model](https://github.com/apache/cassandra-harry#model) section of this 
guide for more details.
+
+For definitions of logical timestamp, descriptor, and other entities used 
during inflation and deflation, refer
+to [formal 
relationships](https://github.com/apache/cassandra-harry#formal-relations-between-entities)
 section.
+
+# Features
+
+Currently, Harry can exercise the following Cassandra functionality:
+
+* Supported data types: `int8`, `int16`, `int32`, `int64`, `boolean`, `float`, 
`double`, `ascii`, `uuid`, `timestamp`.
+  Collections are only _inflatable_.
+* Random schema generation, with an arbitrary number of partition and 
clustering keys.
+* Schemas with arbitrary `CLUSTERING ORDER BY`
+* Randomly generated `INSERT` and `UPDATE` queries with all columns or 
arbitrary column subset
+* Randomly generated `DELETE` queries: for a single column, single row, or a 
range of rows
+* Inflating and validating entire partitions (with allowed in-flight queries)
+* Inflating and validating random `SELECT` queries: single row, slices (with 
single open end), and ranges (with both
+  ends of clusterings specified)
+
+Inflating partitions is done
+using 
[Reconciler](https://github.com/apache/cassandra-harry/blob/master/harry-core/src/harry/reconciler/Reconciler.java).
+Validating partitions and random queries can be done
+using [Quiescent 
Checker](https://github.com/apache/cassandra-harry/blob/master/harry-core/src/harry/model/QuiescentChecker.java)
+and [Exhaustive 
Checker](https://github.com/apache/cassandra-harry/blob/master/harry-core/src/harry/model/ExhaustiveChecker.java).

Review Comment:
   I've removed the navigation references for now. I think it should be 
straightforward to search for sections. Maybe it would be good to split the doc 
into segments/subdocs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Bring Harry into C* Tree [cassandra]

Reply via email to