moomman commented on issue #290:
URL: 
https://github.com/apache/shardingsphere-on-cloud/issues/290#issuecomment-1495561560

   
   
   
   
   > # ShardingSphereChaos CRD Design document
   > # 1、Background
   > It is necessary to introduce the automatic experiment flow of chaos into 
ss to enhance the toughness and failure recovery ability of ss.
   > 
   > # 2、Problem description
   > Chaos experiment should be automated to avoid the experimental 
environment, injection flow, verification of the duplication of work
   > 
   > ## 2.1 Question 1: How to inject
   > How can specific failure scenarios be introduced into ss
   > 
   > ## 2.2 Question 2: How to generate pressure
   > How would a large number of specified requests be sent to ss-proxy during 
a failure to simulate a real production environment
   > 
   > ## 2.3 Question 3: How to verify the Result
   > During the experiment, how to collect relevant information and set the 
steady-state to prove whether the system is in steady-state
   > 
   > # 3、Technical research
   > Chaos Mesh or Litmus provides different kinds of chaos experiments, 
covering most usage scenarios. It only has the ability to inject faults, while 
experimental environments and verifying the influence of faults on steady state 
need to be repeated in each experiment. Therefore, we need to define our own 
crd to realize the automated experiment process for ss-proxy, and use 
kubebuilder to generate the skeleton code of crd
   > 
   > technology address
   > Chaos Mesh API definition  https://github.com/chaos-mesh/chaos-mesh
   > kubebuilder        https://github.com/kubernetes-sigs/kubebuilder
   > Litmus chaos       https://litmuschaos.io/
   > # 4、Scheme design
   > ## 4.1 Program summary
   > ### injection:
   > In order to solve the problem of how to inject faults into ss, the 
commonly used solution is pingCAP open source Chaos Mesh or Litmus Chaos, which 
provides a variety of common fault types, but for the construction of automated 
ss chaotic scenario flow, it can not be introduced directly because of its 
complexity and independence of configuration. Chaos Mesh has provided the 
corresponding API of all CRD resource definitions, which provides the 
possibility of simplifying the operation. We can abstract our own chaotic 
scenarios and interact with Chao Mesh to obtain experimental information. For 
the implementation of interaction, you can refer to Chaos Mesh's official Chaos 
DashBoard.
   > 
   > ### Generating pressure:
   > With regard to the configuration environment and pressure, you can use 
DistSQL to make a request to the ss-proxy, inject data into the environment, 
and use it as proof to verify the steady state.
   > 
   > ### Verification:
   > In the verification of steady state, we can grab the monitoring log to 
observe whether the CPU,NetWork IO fluctuates in the steady state, and verify 
the correctness of the previous request in the pressure phase by DistSQL.
   > 
   > ## 4.2 Holistic design
   > * The chaos experiment for ss-proxy has the following parts
   >   
   >   * Use DistSQL to specify the configuration of proxy-environment to 
create the specified experimental environment
   >   * Establish a steady-state hypothesis, declare a specific fault, 
ss-Chaos converts the fault into a Chaos Mesh fault, and Chaos Mesh injects 
this fault into the environment
   >   * In the experiment, ssChaos puts the declared fields that generate 
traffic (`.spec.accountReq`) into jobs, and jobs send traffic requests to the 
experimental environment.
   >   * After fault injection and the start of the experiment, ssChaos grabs 
the data and indicators in the experimental environment as a criterion for 
judging whether the final experiment is in a steady state.
   > * The specific process is as follows:
   > 
   > <img alt="image" width="822" 
src="https://user-images.githubusercontent.com/85389467/229732327-6fe762ca-12a0-4929-9b9f-519294395bea.png";>
   > 
   > ComputeNode        ss-proxy, as an object for upstream service 
interaction, interacts with the downstream database
   > StorageNode        Connect to the database of ss, the node that actually 
stores the data
   > Governance node    Used to store status and configuration information in 
ComputeNode, such as logical libraries, logical tables, etc.
   > DistSQL    It is a unique operating language of Apache ShardingSphere. It 
is used in exactly the same way as standard SQL and is used to provide 
SQL-level operational capabilities for incremental functionality.
   > proxy-environment  A fully functional ss-proxy environment
   > Chaos APIs Different kinds of chaos experiments are provided, which are 
responsible for the actual injection and execution of faults.
   > ssChaos Controller Responsible for managing the created ssChaos resources
   > ## 4.3 Function design
   > It is functionally divided into three parts: injection fault, voltage 
generation and fault; users can use related functions by defining cr 
declaration files
   > 
   > ### 4.3.1 Feature list
   > * Injection chaos
   >   Convert the fault declared by the user to the fault type in Chaos Mesh 
and inject it into the specified experimental environment
   > * Generating pressure
   >   Inject traffic into the experimental environment
   > * Verification
   > 
   > The important indexes such as CPU and network IO of the experimental 
target and the program output are collected and compared with the steady-state 
conditions, and the correctness of the flow in the pressure phase is verified.
   > 
   > ## 4.4 CRD design
   > ### 4.4.1 Spec
   > * Injection fault
   >   `.spec.chaosKind` Used to specify the type of injection failure
   >   To specify the type of injection fault, the common fault field is 
configured in the spec. When accessing the fault provided by the platform, the 
platform type needs to be written in the annotations, and the fields not 
mentioned in the fault spec for this platform are written in the annotations.
   >   
   >   * Common configuration field
   >     
   >     * Selector
   >       Fault target selector
   > 
   > namespaces Specify namespaces
   > labelSelectors     Specify selection label
   > annotationSelectors        Specify comment
   > nodes      Specify nodes
   > pods       Specified as a namespace-pod name
   > nodeSelectors      Select nodes with label
   > * PodChaosSpec
   > 
   > This part of the statement is in `spec.podChaos` A fault that defines the 
type of pod, and the action field declares the type of fault that is injected 
into pod
   > 
   > action     Specify the fault type of pod, divided into 
podFailure,containerKill
   > podFailure.Duration        Specify the effective time of the 
PodFailureAction
   > containerKill.containerNames       Specify the container to be killed
   > * networkSpec
   >   This part of the statement is in `.spec.networkChaos`
   > 
   > Define faults of network type
   > 
   > Action     Define chaos of network type, divided into 
delay,duplicate,corrupt,partition,loss
   > Duration   Specify the duration of chaos
   > Direction  It is used to specify the direction of network failure. When 
not specified, it defaults to to, which is divided into to (- > target), from 
(target < -), and both (<-> target).
   > target     selector,Used to select target object
   > Source     selector,Used to select source object
   > delay.latency
   > delay.correlation
   > delay.jitter       latency: Indicates the network latency
   > correlation: Indicates the correlation between the current latency and the 
previous one
   > jitter: Indicates the range of the network latency
   > loss.correlation
   > loss.loss  loss: Indicates the probability of packet loss
   > correlation: Indicates the correlation between the probability of current 
packet loss and the previous time's packet loss.
   > duplicate.correlation
   > duplicate.duplicate        correlation: Indicates the correlation between 
the probability of current packet duplicating
   > duplicate: Indicates the probability of packet duplicating
   > corrupt.corrupt
   > corrupt.correlation        corrupt: Indicates the probability of packet 
corruption
   > correlation: Indicates the correlation between the probability of current 
packet corruption and the previous time's packet corruption.
   > * Specific configuration spec
   >   This part needs to be declared in annotations or env
   >   
   >   * chaos-mesh
   > 
   > Configuration field of podchaos    spec/mode <-----> selector.mode
   > spec/value <-----> selector.value
   > spec/pod/action <-----> specify .action
   > spec/pod/gracePeriod <-----> specify .gracePeriod
   > Configuration field of networkchaos        spec/device <-----> .device
   > spec/targetDevice <-----> .targetDevice
   > spec/target/mode <-----> .selector.mode
   > spec/target/value <-----> .value
   > spec/network/action <-----> specify .action
   > spec/network/rate <-----> .bandwidth.rate
   > spec/network/limit <-----> .bandwidth.limit
   > spec/network/buffer <-----> .bandwidth.buffer
   > spec/network/peakrate <-----> .bandwidth.peakrate
   > spec/network/minburst <-----> .bandwidth.minburst
   > ```
   > - Litmus chaos
   > ```
   > 
   > Configuration field of podchaos    - pod-delete
   > spec/random <-------> RANDOMNESS
   > - Container-kill
   > spec/signal <------> SIGNAL
   > spec/chaos_interval <-----> CHAOS_INTERVAL
   > Configuration field of networkchaos        
   > Public field       spec/action <----> .spec.experiments.name
   > spec/ramp_time <-----> RAMP_TIME
   > spec/duration <-------> TOTAL_CHAOS_DURATION
   > spec/sequence <-----> SEQUENCE
   > spec/lib_image <-----> LIB_IMAGE
   > spec/lib <----> LIB
   > spec/force <-----> FORCE
   > * generating pressure
   > * Verification
   > 
   > ### 4.4.2 Status
   > * DeploymentCondition
   >   This field records the progress of the injection chaos, which has the 
following four phases
   > 
   > Creating   It means that chaos is in the creation stage and has not yet 
completed the injection.
   > AllRecovered       Indicates that the environment has recovered from 
failure
   > Paused     The experiment may be paused because the selected node does not 
exist. Consider whether there is a problem with the definition of crd.
   > AllInjected        This stage indicates that the fault has been 
successfully injected into the environment.
   > ### 4.4.3 Controller design
   > * Overall logic of the controller
   > 
   > 1. Convert the ssChaos to apply to the fault type in chaos-mesh and create.
   >    According to ssChaos's .spec.EmbedChaos declaration, create the 
corresponding Chaos Mesh type and set .status.DeploymentCondition to the 
Creating state.
   > 2. status
   > 
   > * update `.status.DeploymentCondition`
   >   
   >   * For chaos-mesh
   >     Chaos Mesh indicates the progress of the current experiment by 
updating the Status of four types of Type. They are used as the basis for 
changing `.status.DeploymentCondition`
   >     The change logic is as follows:
   >     After all the faults currently concerned in chaos-mesh have entered 
the AllInjected phase, we can change our state from creating to `AllInjected`.
   > 
   > When there is a fault in `paused`, we should check whether the pod and 
container of our choice are working properly.. We update our status to 
AllRecovered in the case of all malfunctioning `Recovered` What is mentioned in 
the chaos-mesh document is also used as a basis for updating status.
   > 
   > <img alt="image" width="822" 
src="https://user-images.githubusercontent.com/85389467/229732576-ccb42d2b-12ba-444d-9709-378003b98211.png";>
   > 
   > * Pressure and verification at different stages of  
`.status.DeploymentCondition`
   > 
   > 1. AllInjected
   > 
   > At this time, the fault has been injected, pressure operation should be 
carried out, and data collection should be carried out.
   > 
   > 1. AllRecovered
   >    Verify the operation and check whether the operation during the 
pressure period is performed correctly.
   > 2. Extended platform
   >    When you need to extend more API interfaces of chaos, the interfaces 
that need to be implemented for pod and network types are as follows:
   > 
   > About the `get/set` interface of chaos
   > 
   > ```go
   > type ChaosGetter interface {
   >    GetPodChaosByNamespacedName(context.Context, types.NamespacedName) 
(PodChaos, error)
   >    GetNetworkChaosByNamespacedName(context.Context, types.NamespacedName) 
(NetworkChaos, error)
   > }
   > 
   > type ChaosSetter interface {
   > }
   > ```
   > 
   > About the `update/create/New ` interface of chaos
   > 
   > ```go
   > type ChaosHandler interface {
   >    NewPodChaos(ssChao *v1alpha1.ShardingSphereChaos) chaos.PodChaos
   >    NewNetworkPodChaos(ssChao *v1alpha1.ShardingSphereChaos) 
chaos.NetworkChaos
   >    UpdateNetworkChaos(ctx context.Context, ssChaos 
*v1alpha1.ShardingSphereChaos, r client.Client, cur chaos.NetworkChaos) error
   >    UpdatePodChaos(ctx context.Context, ssChaos 
*v1alpha1.ShardingSphereChaos, r client.Client, cur chaos.PodChaos) error
   >    CreatePodChaos(ctx context.Context, r client.Client, podChao 
chaos.PodChaos) error
   >    CreateNetworkChaos(ctx context.Context, r client.Client, networkChao 
chaos.NetworkChaos) error
   > }
   > ```
   > 
   > ## 4.5 Expected
   > ### 4.5.1 Expected effect
   > Create a definition yaml file for CR
   > 
   > ```yaml
   > apiVersion: shardingsphere.apache.org/v1alpha1
   > kind: ShardingSphereChaos
   > metadata:
   >   labels:
   >     app.kubernetes.io/name: shardingsphereChaos
   >   name: shardingspherechaos-lala
   >   annotations:
   >     spec/mode: all
   > spec:
   >   chaosKind: podChaos
   >   podChaos:
   >     selector:
   >       labelSelectors:
   >         app.kubernetes.io/component: zookeeper-new
   >       namespaces: [ "mesh-test" ]
   >     podFailure:
   >       duration: "1m"
   >     action: "podFailure"
   > ```
   > 
   > After applying, the chaos object is created successfully, and you can see 
the following information
   > 
   > ![](static/FWZcbXMwboQYJNxqGifcasrEnZe.png)
   > 
   > # 5、Demo
   > # 6、References
   > * [Chaos Mesh 
原理分析与控制面开发](https://cloudnative.to/blog/chaos-engineering-with-kubernetes)
   > * chao-mesh.org
   > * [ShardingSphere](https://shardingsphere.apache.org/document/)
   
   [issue-272](https://github.com/apache/shardingsphere-on-cloud/issues/272)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to