[GitHub] [shardingsphere-on-cloud] moomman opened a new issue, #290: [DISCUSS] Introduce New CRD Chaos

via GitHub Thu, 13 Jul 2023 07:34:12 -0700


moomman opened a new issue, #290:
URL: https://github.com/apache/shardingsphere-on-cloud/issues/290


   # Chaos CRD Design document
   
   # 1、Background
   
   It is necessary to introduce the automatic experiment flow of chaos into ss 
to enhance the toughness and failure recovery ability of ss.
   
   # 2、Problem description
   
   Chaos experiment should be automated to avoid the experimental environment, 
injection flow, verification of the duplication of work
   
   ## 2.1 Question 1: How to inject
   
   How can specific failure scenarios be introduced into ss
   
   ## 2.2 Question 2: How to generate pressure
   
   How would a large number of specified requests be sent to ss-proxy during a 
failure to simulate a real production environment
   
   ## 2.3 Question 3: How to verify the Result
   
   During the experiment, how to collect relevant information and set the 
steady-state to prove whether the system is in steady-state
   
   # 3、Technical research
   
   Chaos Mesh or Litmus provides different kinds of chaos experiments, covering 
most usage scenarios. It only has the ability to inject faults, while 
experimental environments and verifying the influence of faults on steady state 
need to be repeated in each experiment. Therefore, we need to define our own 
crd to realize the automated experiment process for ss-proxy, and use 
kubebuilder to generate the skeleton code of crd
   
   
   | technology                | address                                        
                                                  |
   | --------------------------- | 
--------------------------------------------------------------------------------------------------
 |
   | Chaos Mesh API definition | 
[https://github.com/chaos-mesh/chaos-mesh](https://github.com/chaos-mesh/chaos-mesh)
             |
   | kubebuilder               | 
[https://github.com/kubernetes-sigs/kubebuilder](https://github.com/kubernetes-sigs/kubebuilder)
 |
   | Litmus chaos              | 
[https://litmuschaos.io/](https://litmuschaos.io/)                              
                 |
   
   # 4、Scheme design
   
   ## 4.1 Program summary
   
   ### injection：
   
   In order to solve the problem of how to inject faults into ss, the commonly 
used solution is pingCAP open source Chaos Mesh or Litmus Chaos, which provides 
a variety of common fault types, but for the construction of automated ss 
chaotic scenario flow, it can not be introduced directly because of its 
complexity and independence of configuration.
   Chaos Mesh has provided the corresponding API of all CRD resource 
definitions, which provides the possibility of simplifying the operation. We 
can abstract our own chaotic scenarios and interact with Chao Mesh to obtain 
experimental information. For the implementation of interaction, you can refer 
to Chaos Mesh's official Chaos DashBoard.
   
   ### Generating pressure：
   
   With regard to the configuration environment and pressure, you can use 
DistSQL to make a request to the ss-proxy, inject data into the environment, 
and use it as proof to verify the steady state.
   
   ### Verification：
   
   In the verification of steady state, we can grab the monitoring log to 
observe whether the CPU,NetWork IO fluctuates in the steady state, and verify 
the correctness of the previous request in the pressure phase by DistSQL.
   
   ## 4.2 Holistic design
   
   - The chaos experiment for ss-proxy has the following parts
   
     - Use DistSQL to specify the configuration of proxy-environment to create 
the specified experimental environment
     - Establish a steady-state hypothesis, declare a specific fault, ss-Chaos 
converts the fault into a Chaos Mesh fault, and Chaos Mesh injects this fault 
into the environment
     - In the experiment, ssChaos puts the declared fields that generate 
traffic (`.spec.accountReq`) into jobs, and jobs send traffic requests to the 
experimental environment.
     - After fault injection and the start of the experiment, ssChaos grabs the 
data and indicators in the experimental environment as a criterion for judging 
whether the final experiment is in a steady state.
   - The specific process is as follows:
   
   
![image](https://github.com/apache/shardingsphere-on-cloud/assets/85389467/8072f834-9951-4ade-9a55-745c465b31f5)
   
   
   | ComputeNode        | ss-proxy, as an object for upstream service 
interaction, interacts with the downstream database                             
                                                                            |
   | -------------------- | 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
   | StorageNode        | Connect to the database of ss, the node that actually 
stores the data                                                                 
                                                                  |
   | Governance node    | Used to store status and configuration information in 
ComputeNode, such as logical libraries, logical tables, etc.                    
                                                                  |
   | DistSQL            | It is a unique operating language of Apache 
ShardingSphere. It is used in exactly the same way as standard SQL and is used 
to provide SQL-level operational capabilities for incremental functionality. |
   | proxy-environment  | A fully functional ss-proxy environment               
                                                                                
                                                                  |
   | Chaos APIs         | Different kinds of chaos experiments are provided, 
which are responsible for the actual injection and execution of faults.         
                                                                     |
   | ssChaos Controller | Responsible for managing the created ssChaos 
resources                                                                       
                                                                           |
   
   ## 4.3 Function design
   
   It is functionally divided into three parts: injection fault, voltage 
generation and fault; users can use related functions by defining cr 
declaration files
   
   ### 4.3.1 Feature list
   
   - Injection chaos
     Convert the fault declared by the user to the fault type in Chaos Mesh and 
inject it into the specified experimental environment
   - Generating pressure
     Inject traffic into the experimental environment
   - Verification
     Collect the CPU, network IO and other important indicators and program 
output of the experimental target and
     compare them with the steady-state condition; And verify the correctness 
of the flow rate in the pressure
     generation stage.
   
   ## 4.4 CRD design
   
   ### 4.4.1 Spec
   
   - Generating pressure
     It is used to specify the tools to be used and the configuration of the 
pressure request
   
     - jobSpec
   
   
     | .injectJob.Experimental string | Specifies a pressure request that 
defines steady-state, which is executed before and after chaos injection |
     | :------------------------------- | 
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
     | .injectJob.Pressure   string   | Specifies the data update request that 
is executed during chaos injection                                              
                                      |
   
   
     * PressureCfg
   
   
     | reqNum              | Number of requests in per reqTime           |
     | --------------------- | --------------------------------------------- |
     | concurrentNum       | Number of concurrent                        |
     | reqTime             | req in per reqTime                          |
     | duration            | request duration                            |
     | zkHost              | Zookeeper connection address                |
     | ssHost              | ShardingSphere connection address           |
     | script(optional)    | Custom command script passed in by the user |
     | distSQLs  []distSQL | The disSQLs we want exec in pressure        |
   
     * distSQL
   
   
     | sql           | The SQL we will exec,use "?" to represent arg |
     | --------------- | ----------------------------------------------- |
     | args []string | args will put to sql                          |
   - Injection fault
     `.spec.chaosKind` Used to specify the type of injection failure
     To specify the type of injection fault, the common fault field is 
configured in the spec. When accessing the fault provided by the platform, the 
platform type needs to be written in the annotations, and the fields not 
mentioned in the fault spec for this platform are written in the annotations.
   
     - Common configuration field
   
       - Selector
         Fault target selector
   
   
   | namespaces          | Specify namespaces                |
   | --------------------- | ----------------------------------- |
   | labelSelectors      | Specify selection label           |
   | annotationSelectors | Specify comment                   |
   | nodes               | Specify nodes                     |
   | pods                | Specified as a namespace-pod name |
   | nodeSelectors       | Select nodes with label           |
   
   - PodChaosSpec
   
   This part of the statement is in `spec.podChaos`
   A fault that defines the type of pod, and the action field declares the type 
of fault that is injected into pod
   
   
   | action                       | Specify the fault type of pod, divided into 
podFailure,containerKill |
   | ------------------------------ | 
---------------------------------------------------------------------- |
   | podFailure.Duration          | Specify the effective time of the 
PodFailureAction                   |
   | containerKill.containerNames | Specify the container to be killed          
                         |
   
   - networkSpec
     This part of the statement is in `.spec.networkChaos`
   
   Define faults of network type
   
   
   | Action                                               | Define chaos of 
network type, divided into delay,duplicate,corrupt,partition,loss               
                                                                                
                   |
   | ------------------------------------------------------ | 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
   | Duration                                             | Specify the 
duration of chaos                                                               
                                                                                
                       |
   | Direction                                            | It is used to 
specify the direction of network failure. When not specified, it defaults to 
to, which is divided into to (- > target), from (target < -), and both (<-> 
target).                    |
   | target                                               | selector,Used to 
select target object                                                            
                                                                                
                  |
   | Source                                               | selector,Used to 
select source object                                                            
                                                                                
                  |
   | delay.latency<br/>delay.correlation<br/>delay.jitter | latency: Indicates 
the network latency<br/>correlation: Indicates the correlation between the 
current latency and the previous one<br/>jitter: Indicates the range of the 
network latency          |
   | loss.correlation<br/>loss.loss                       | loss: Indicates the 
probability of packet loss<br/>correlation: Indicates the correlation between 
the probability of current packet loss and the previous time's packet loss.     
                 |
   | duplicate.correlation<br/>duplicate.duplicate        | correlation: 
Indicates the correlation between the probability of current packet 
duplicating<br/>duplicate: Indicates the probability of packet duplicating      
                                  |
   | corrupt.corrupt<br/>corrupt.correlation              | corrupt: Indicates 
the probability of packet corruption<br/>correlation: Indicates the correlation 
between the probability of current packet corruption and the previous time's 
packet corruption. |
   
   - Specific configuration spec
     This part needs to be declared in annotations or env
   
     - chaos-mesh
   
   
   | Configuration field of podchaos     | spec/mode <-----> 
selector.mode<br/>spec/value <-----> selector.value<br/>spec/pod/action <-----> 
specify .action<br/>spec/pod/gracePeriod <-----> specify .gracePeriod           
                                                                                
                                                                                
                                                                                
                                   |
   | ------------------------------------- | 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
   | Configuration field of networkchaos | spec/device <-----> 
.device<br/>spec/targetDevice <-----> .targetDevice<br/>spec/target/mode 
<-----> .selector.mode<br/>spec/target/value <-----> 
.value<br/>spec/network/action <-----> specify .action<br/>spec/network/rate 
<-----> .bandwidth.rate<br/>spec/network/limit <-----> 
.bandwidth.limit<br/>spec/network/buffer <-----> 
.bandwidth.buffer<br/>spec/network/peakrate <-----> 
.bandwidth.peakrate<br/>spec/network/minburst <-----> .bandwidth.minburst |
   
   ```
   - Litmus chaos
   ```
   
   
   | Configuration field of podchaos     | - pod-delete<br/>spec/random 
<-------> RANDOMNESS<br/>- Container-kill<br/>spec/signal <------> 
SIGNAL<br/>spec/chaos_interval <-----> CHAOS_INTERVAL                           
                                                                               |
   | ------------------------------------- | 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
   | Configuration field of networkchaos |                                      
                                                                                
                                                                                
                                                          |
   | Public field                        | spec/action <----> 
.spec.experiments.name<br/>spec/ramp_time <-----> RAMP_TIME<br/>spec/duration 
<-------> TOTAL_CHAOS_DURATION<br/>spec/sequence <-----> 
SEQUENCE<br/>spec/lib_image <-----> LIB_IMAGE<br/>spec/lib <----> LIB <br 
/>spec/force <-----> FORCE |
   
   - Verification
     Collect logs and indicators based on the point in time when the fault is 
injected, and collect indicators in the steady state and fault to determine 
whether the test passes
     The verification is realized in the way of controlled experiment, which is 
divided into steady state experimental group and fault experimental group
     Ideally, the only variable for both sets of experiments is whether there 
is a fault in the experimental environment
     Whether the results meet the expectations is judged by the steady-state 
fluctuation and pressure job execution results set by us
   
   <img width="1032" alt="image" 
src="https://github.com/apache/shardingsphere-on-cloud/assets/85389467/768851f3-d5cf-4e4d-941c-42c9f18d5290";>
   
     As shown in the above picture, the specific process is as follows:
     Steady state:
   
     1. Create a pressure job
     2. Collect the concerned contents in the metrics log, record and wait for 
the comparison with the fault metrics
        Failure:
     3. Create a chaos fault
     4. Collect metrics logs, compare them with steady state, and record the 
results in status
        Perform a job during steady state and a job during a fault.
        After the chao recovers, verify the execution result of the pressure 
job when the fault occurs and record it in the status
   
   ### 4.4.2 Status
   
   - ChaosCondition
     This field records the progress of the injection failure, which has the 
following five phasesThis field records the progress of the injection chaos, 
which has the following four phase
   
   
   | Creating     | It means that chaos is in the creation stage and has not 
yet completed the injection.                                                  |
   | -------------- | 
----------------------------------------------------------------------------------------------------------------------------------------
 |
   | AllRecovered | Indicates that the environment has recovered from failure   
                                                                           |
   | Paused       | The experiment may be paused because the selected node does 
not exist. Consider whether there is a problem with the definition of crd. |
   | AllInjected  | This stage indicates that the fault has been successfully 
injected into the environment.                                               |
   | Unknown      | Unknown status                                              
                                                                           |
   
   ### 4.4.3 Controller design
   
   - Overall logic of the controller
   
   1. Convert the ssChaos to apply to the fault type in chaos-mesh and create.
   2. status
   
   ```Go
   ChaosCondition ChaosCondition `json:"chaosCondition"`
   Phase          Phase          `json:"phase"`
   Result         []Result       `json:"result"`
   ```
   
   * update `.status.ChaosCondition`Chaos Mesh displays the progress of the 
current experiment by updating the status of four types of Type. They are used 
as the basis for changes of.status.ChaosCondition
     The change logic is as follows:
     Only after all the failures we are currently concerned with in chaos-mesh 
have entered the AllInjected phase can we change our state from creating to 
AllInjected.
     paused, we should check whether the pod and container we selected are 
running properly when the fault is paused.
     When all faults are Recovered, we update our status to AllRecovered
     As mentioned in the chaos-mesh document, it also serves as the evaluation 
basis for the updated status
   
     <img width="1021" alt="image" 
src="https://github.com/apache/shardingsphere-on-cloud/assets/85389467/16ececc0-8193-4764-bf76-237d9dd67f17";>
   * The different stages of `.status.ChaosCondition` are pressed and verified
     Data collection was performed for steady-state requests prior to injection 
of the fault, and specified requests were made to the environment to collect 
data after injection (in Allnjected state)
   * update phase
   
    <img width="1028" alt="image" 
src="https://github.com/apache/shardingsphere-on-cloud/assets/85389467/6c8a5f99-2fb4-438b-b7fa-261b29b58d55";>
   
     BeforeReq-- AfterReq is the initial stage, at which the experiment job is 
created and pressure request is injected into the environment. In this stage, 
logs, indicators and steady-state are collected
     AfterReq----Injected into this phase after the log collection and job had 
been successfully executed in the previous phase, where fault injection was 
carried out
     Injected---Recovered:  When the chaosCondition was Injected and the phase 
was in AfterReq, the injected phase entered into the injected phase, carried 
out the pressure job and experiment job execution, collected logs and 
indicators at this time, and compared them with the steady state. The 
comparison results were written back into the result
     Recovered: When the chaosCondition is Recovered and the phase is in the 
Injected stage, it enters this stage and has recovered from the fault; verify 
job execution and obtain the podlog of the job to check whether the pressure 
job is successful. And write the result back to result
   * Result
     Two experimental results are presented
   
   
     | Steady  Msg | Steady phase result |
     | ------------- | --------------------- |
     | Chaos   Msg | Chaos phase result  |
   
   
     * Msg
   
   
     | Metrics TODO    | show interested metrics msg                      |
     | ----------------- | -------------------------------------------------- |
     | Result string   | The execution result of the pressure request     |
     | Duration string | The total execution time of the pressure request |
   
   1. Extended platform
      When you need to extend more API interfaces of chaos, the interfaces that 
need to be implemented for pod and network types are as follows:
   
   About the `get/set` interface of chaos
   
   ```go
   type ChaosGetter interface {
      GetPodChaosByNamespacedName(context.Context, types.NamespacedName) 
(PodChaos, error)
      GetNetworkChaosByNamespacedName(context.Context, types.NamespacedName) 
(NetworkChaos, error)
   }
   
   type ChaosSetter interface {
   }
   ```
   
   About the `update/create/New ` interface of chaos
   
   ```go
   type ChaosHandler interface {
      NewPodChaos(ssChao *v1alpha1.ShardingSphereChaos) chaos.PodChaos
      NewNetworkPodChaos(ssChao *v1alpha1.ShardingSphereChaos) 
chaos.NetworkChaos
      UpdateNetworkChaos(ctx context.Context, ssChaos 
*v1alpha1.ShardingSphereChaos, r client.Client, cur chaos.NetworkChaos) error
      UpdatePodChaos(ctx context.Context, ssChaos 
*v1alpha1.ShardingSphereChaos, r client.Client, cur chaos.PodChaos) error
      CreatePodChaos(ctx context.Context, r client.Client, podChao 
chaos.PodChaos) error
      CreateNetworkChaos(ctx context.Context, r client.Client, networkChao 
chaos.NetworkChaos) error
   }
   ```
   
   ## 4.5 Expected
   
   ### 4.5.1 Expected effect
   
   Create a definition yaml file for CR
   
   ```yaml
   apiVersion: shardingsphere.apache.org/v1alpha1
   kind: ShardingSphereChaos
   metadata:
     labels:
       app.kubernetes.io/name: shardingsphereChaos
     name: shardingspherechaos-lala
     namespace: verify-lit
     annotations:
       selector.chaos-mesh.org/mode: all
   spec:
     podChaos:
       selector:
         labelSelectors:
           app.kubernetes.io/component: zookeeper
         namespaces: [ "verify-lit" ]
       action: PodFailure
       params:
         podFailure:
           duration: 10s
     pressureCfg:
       ssHost: root:14686Ban@tcp(127.0.0.1:3306)/ds_0
       duration: 10s
       reqTime: 5s
       distSQLs:
         - sql: select * from car;
       concurrentNum: 1
       reqNum: 2
   
   ```
   
   After applying, the chaos object is created successfully, and you can see 
the following information
   
   ![](static/FWZcbXMwboQYJNxqGifcasrEnZe.png)
   
   # 5、Demo
   
   # 6、References
   
   - [Chaos Mesh 
原理分析与控制面开发](https://cloudnative.to/blog/chaos-engineering-with-kubernetes)
   - chao-mesh.org
   - [ShardingSphere](https://shardingsphere.apache.org/document/)
   
   #272
   
   The change logic is as follows:
   Only after all the failures we are currently concerned with in chaos-mesh 
have entered the AllInjected phase can we change our state from creating to 
AllInjected.
   paused, we should check whether the pod and container we selected are 
running properly when the fault is paused.
   When all faults are Recovered, we update our status to AllRecovered
   As mentioned in the chaos-mesh document, it also serves as the evaluation 
basis for the updated status
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [shardingsphere-on-cloud] moomman opened a new issue, #290: [DISCUSS] Introduce New CRD Chaos

Reply via email to