[ 
https://issues.apache.org/jira/browse/CASSANDRA-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict Elliott Smith updated CASSANDRA-15348:
-----------------------------------------------
    Description: 
h2. Description:

This ticket introduces Harry, a component for fuzz testing and verification of 
the Apache Cassandra clusters at scale. 

h2. Motivation: 

Current testing tooling largely tests for common- and edge-cases, and most of 
the tests use predefined datasets. Property-based tests can help explore a 
broader range of states, but often require either a complex model or a large 
state to test against.

h2. What problems Harry solves:

Harry allows to run tests that are able to validate state of both dense nodes 
(to test local read-write path) and large clusters (to test distributed 
read-write path), and do it efficiently. Main goals, and what sets it apart 
from the other testing tools is:

 * The state required for verification should remain as compact as possible.
 * The verification process itself should be as performant as possible.
 * Ideally, we'd want a way to verify database state while _continuing_ running 
state change queries against it.

h2. What Harry does: 

To achieve this, Harry defines a model that holds the state of the database, 
generators that produce reproducible, pseudo-random schemas, mutations, and 
queries, and a validator that asserts the correctness of the model following 
execution of generated traffic.

h2. Harry consists of multiple reusable components:

 * Generator library: how to create a library of invertible, order-preserving 
generators for simple and composite data types.
 * Model and checker: how to use the properties of generators to validate the 
output of an eventually-consistent database in a linear time.
 * Runner library: how to create a scheme for reproducible runs, despite the 
concurrent nature of database and fuzzer itself.

h2. Short and somewhat approximate description of how Harry achieves this:

Generation and validation define strict mathematical relations between the 
generated values and pseudorandom numbers they were generated from. Using these 
properties, we can store minimal state and check if these properties hold 
during validation.

Since Cassandra stores data in rows, we should be able to "inflate" data to 
insert a row into the database from a single number we call _descriptor_. Each 
value in the row read from the database can be "deflated" back to the 
descriptor it was generated from. This way, to precisely verify the state of 
the row, we only need to know the descriptor it was generated from and a 
timestamp at which it was inserted.

Similarly, keys for the inserted row can be "inflated" from a single 64-bit 
integer, and then "deflated" back to it. To efficiently search for keys, while 
allowing range scans, our generation scheme preserves the order of the original 
64-bit integer. Every pair of keys generated from two 64-bit integers would 
sort the same way as these integers.

This way, in order to validate a state of the range of rows queried from the 
database, it is sufficient to "deflate" its key and data values, use deflated 
64-bit key representation to find all descriptors these rows were generated 
from, and ensure that the given sequence of descriptors could have resulted in 
the state that database has responded with.

Using this scheme, we keep a minimum possible amount of data per row, can 
efficiently generate the data, and backtrack values to the numbers they were 
generated from. Most of the time, we operate on 64-bit integer values and only 
use "inflated" objects when running queries against database state, minimizing 
the amount of required memory.

h2. Name: 

Harry (verb). 

According to Marriam-Webster: 
  * to torment by or as if by constant attack
  * persistently carry out attacks on (an enemy or an enemy's territory)

  was:
h2. Description:

This ticket introduces Harry, a component for fuzz testing and verification of 
the Apache Cassandra clusters at scale. 

h2. Motivation: 

Current testing tooling largely tests for common- and edge-cases, and most of 
the tests use predefined datasets. Property-based tests can help explore a 
broader range of states, but often require either a complex model or a large 
state to test against.

h2. What problems Harry solves:

Harry allows to run tests that are able to validate state of both dense nodes 
(to test local read-write path) and large clusters (to test distributed 
read-write path), and do it efficiently. Main goals, and what sets it apart 
from the other testing tools is:

 * The state required for verification should remain as compact as possible.
 * The verification process itself should be as performant as possible.
 * Ideally, we'd want a way to verify database state while _continuing_ running 
state change queries against it.

h2. What Harry does: 

To achieve this, Harry defines a model that holds the state of the database, 
generators that produce reproducible, pseudo-random schemas, mutations, and 
queries, and a validator that asserts the correctness of the model following 
execution of generated traffic.

h2. Harry consists of multiple reusable components:

 * Generator library: how to create a library of invertible, order-preserving 
generators for simple and composite data types.
 * Model and checker: how to use the properties of generators to validate the 
output of an eventually-consistent database in a linear time.
 * Runner library: how to create a scheme for reproducible runs, despite the 
concurrent nature of database and fuzzer itself.

h2. Short and somewhat approximate description of how Harry achieves this:

Generation and validation define strict mathematical relations between the 
generated values and pseudorandom numbers they were generated from. Using these 
properties, we can store minimal state and check if these properties hold 
during validation.

Since Cassandra stores data in rows, we should be able to "inflate" data to 
insert a row into the database from a single number we call _descriptor_. Each 
value in the row read from the database can be "deflated" back to the 
descriptor it was generated from. This way, to precisely verify the state of 
the row, we only need to know the descriptor it was generated from and a 
timestamp at which it was inserted.

Similarly, keys for the inserted row can be "inflated" from a single 64-bit 
integer, and then "deflated" back to it. To efficiently search for keys, while 
allowing range scans, our generation scheme preserves the order of the original 
64-bit integer. Every pair of keys generated from two 64-bit integers would 
sort the same way as these integers.

This way, in order to validate a state of the range of rows queried from the 
database, it is sufficient to "deflate" its key and data values, use deflated 
64-bit key representation to find all descriptors these rows were generated 
from, and ensure that the given sequence of descriptors could have resulted in 
the state that database has responded with.

Using this scheme, we keep a minimum possible amount of data per row, can 
efficiently generate the data, and backtrack values to the numbers they were 
generated from. Most of the time, we operate on 64-bit integer values and only 
use "inflated" objects when running queries against database state, minimizing 
the amount of required memory.

h2. Name: 

Harry (verb). 

According to Marriam-Webster: 
  * to torment by or as if by constant attack
  * to make a pillaging or destructive raid on
  * persistently carry out attacks on (an enemy or an enemy's territory)


> Harry: generator library and extensible framework for fuzz testing Apache 
> Cassandra
> -----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15348
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15348
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Alex Petrov
>            Assignee: Alex Petrov
>            Priority: Normal
>
> h2. Description:
> This ticket introduces Harry, a component for fuzz testing and verification 
> of the Apache Cassandra clusters at scale. 
> h2. Motivation: 
> Current testing tooling largely tests for common- and edge-cases, and most of 
> the tests use predefined datasets. Property-based tests can help explore a 
> broader range of states, but often require either a complex model or a large 
> state to test against.
> h2. What problems Harry solves:
> Harry allows to run tests that are able to validate state of both dense nodes 
> (to test local read-write path) and large clusters (to test distributed 
> read-write path), and do it efficiently. Main goals, and what sets it apart 
> from the other testing tools is:
>  * The state required for verification should remain as compact as possible.
>  * The verification process itself should be as performant as possible.
>  * Ideally, we'd want a way to verify database state while _continuing_ 
> running state change queries against it.
> h2. What Harry does: 
> To achieve this, Harry defines a model that holds the state of the database, 
> generators that produce reproducible, pseudo-random schemas, mutations, and 
> queries, and a validator that asserts the correctness of the model following 
> execution of generated traffic.
> h2. Harry consists of multiple reusable components:
>  * Generator library: how to create a library of invertible, order-preserving 
> generators for simple and composite data types.
>  * Model and checker: how to use the properties of generators to validate the 
> output of an eventually-consistent database in a linear time.
>  * Runner library: how to create a scheme for reproducible runs, despite the 
> concurrent nature of database and fuzzer itself.
> h2. Short and somewhat approximate description of how Harry achieves this:
> Generation and validation define strict mathematical relations between the 
> generated values and pseudorandom numbers they were generated from. Using 
> these properties, we can store minimal state and check if these properties 
> hold during validation.
> Since Cassandra stores data in rows, we should be able to "inflate" data to 
> insert a row into the database from a single number we call _descriptor_. 
> Each value in the row read from the database can be "deflated" back to the 
> descriptor it was generated from. This way, to precisely verify the state of 
> the row, we only need to know the descriptor it was generated from and a 
> timestamp at which it was inserted.
> Similarly, keys for the inserted row can be "inflated" from a single 64-bit 
> integer, and then "deflated" back to it. To efficiently search for keys, 
> while allowing range scans, our generation scheme preserves the order of the 
> original 64-bit integer. Every pair of keys generated from two 64-bit 
> integers would sort the same way as these integers.
> This way, in order to validate a state of the range of rows queried from the 
> database, it is sufficient to "deflate" its key and data values, use deflated 
> 64-bit key representation to find all descriptors these rows were generated 
> from, and ensure that the given sequence of descriptors could have resulted 
> in the state that database has responded with.
> Using this scheme, we keep a minimum possible amount of data per row, can 
> efficiently generate the data, and backtrack values to the numbers they were 
> generated from. Most of the time, we operate on 64-bit integer values and 
> only use "inflated" objects when running queries against database state, 
> minimizing the amount of required memory.
> h2. Name: 
> Harry (verb). 
> According to Marriam-Webster: 
>   * to torment by or as if by constant attack
>   * persistently carry out attacks on (an enemy or an enemy's territory)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to