Hi Comunity, Nowadays, we have unit tests, integration tests, and e2e tests, to ensure the fault tolerance of APISIX. But there are still some problems, like network delay and CPU stress, that have not covered by the above tests. Thus, it would be a better idea to introduce chaos engineering, to simulate different types of faults, and test the performance of APISIX in these circumstances.
To deploy chaos engineering, ChaosMesh[1] could be a good choice for us. There are several benefits above other chaos engineering tools: 1. ChaosMesh is a CNCF sandbox project and has quite an active community, which ensures the project would be better and we could get help when needed. 2. ChaosMesh support Github Actions, so when we set up the workflow of this integration, it would be easy to do the test in our daily working 3. ChaosMesh currently supports most types of different chaos for now and is supporting more. Although we might not need that much for now, it is a good point when we decide to test more with it. BTW, chaos types ChaosMesh supports[2] for now(Nov.16, 2020) includes pod chaos, network chaos, stress chaos, io chaos, time chaos, kernel chaos, HTTP chaos, and DNS chaos. Following the principles of chaos engineering, there are two main parts we need to care about: 1. what should we test and 2. how to prove the correctness after chaos injection. As for what we got for now, the current problems we encounter and need to simulating are: 1. the connection with etcd is unstable 2. etcd failure 3. problems when cpu/memory/disk stressed out And the method to test correctness including: 1. error log of Nginx and APISIX 2. whether cpu/memory use of APISIX is abnormally high 3. whether wrk benchmarking would fail Welcome provide some other problems or correctness that you might find useful to this~ [1] <https://chaos-mesh.org/>https://chaos-mesh.org/ [2] <https://chaos-mesh.org/docs/chaos_experiments> https://chaos-mesh.org/docs/chaos_experiments Thanks, Shuyang Wu
