implement a demo script is useful, agree +1 > By 1.20 (Wed): finish writing the demo script, and present the metrics > of APISIX with Grafana
I have some doubts, why do we need to use grafana here? If it is done in ci, it seems easier to access prometheus directly. On Mon, Jan 18, 2021 at 11:45 PM Shuyang Wu <[email protected]> wrote: > Hi Community, > > It's a bit shame and awkward to resume this feature this late ;( But gladly > I have some new thoughts about it: > > After some more investigation of how people make use of chaos engineering, > to get how things going after certain chaos takes effect, it would be > better to use Prometheus/Grafana to plot the metrics of APISIX performance, > rather than only focusing on nginx logs. Also, since chaos is more about > mocking problems facing in production, directly using monitoring tools > could let us get what users are facing. > > To use Prometheus, we need a demo to run basic functions of APISIX, like a > certain amount of traffic, and new rules set by a certain time interval. It > seems we do not have that kind of demo, so maybe I plan to write a simple > script to implement these features. > > With monitoring tools and the demo, we could then easily run different > kinds of chaos, and see how things going. When we found something > interesting and useful, we could then standardize it, write a test case of > the scenario, and put it into CI. With experiments before, testify certain > case is not that hard, so what we should focus more on is to find those > interesting scenarios. > > A rough time plan would be: > By 1.20 (Wed): finish writing the demo script, and present the metrics > of APISIX with Grafana > By 1.22 (Fri): apply network chaos and see how APISIX works without > etcd. Better test with different chaos cases > By 1.24 (Sun): write test case about the network chaos, and running > on CI > Future: more chaos cases! > > The most uncertain part for me is the demo that I'm both unsure about if we > have that kind of demo or if we don't, some details about writing the > script (like what is normal traffic for APISIX). > Any suggestions are welcome!! > > Best, > Shuyang > > Shuyang Wu <[email protected]> 于2020年11月16日周一 下午12:44写道: > > > Hi Comunity, > > > > Nowadays, we have unit tests, integration tests, and e2e tests, to ensure > > the fault tolerance of APISIX. But there are still some problems, like > > network delay and CPU stress, that have not covered by the above tests. > > Thus, it would be a better idea to introduce chaos engineering, to > simulate > > different types of faults, and test the performance of APISIX in these > > circumstances. > > > > To deploy chaos engineering, ChaosMesh[1] could be a good choice for us. > > There are several benefits above other chaos engineering tools: > > > > 1. ChaosMesh is a CNCF sandbox project and has quite an active > > community, which ensures the project would be better and we could get > help > > when needed. > > 2. ChaosMesh support Github Actions, so when we set up the workflow of > > this integration, it would be easy to do the test in our daily working > > 3. ChaosMesh currently supports most types of different chaos for now > > and is supporting more. Although we might not need that much for now, > it is > > a good point when we decide to test more with it. > > BTW, chaos types ChaosMesh supports[2] for now(Nov.16, 2020) includes > > pod chaos, network chaos, stress chaos, io chaos, time chaos, kernel > chaos, > > HTTP chaos, and DNS chaos. > > > > Following the principles of chaos engineering, there are two main parts > we > > need to care about: 1. what should we test and 2. how to prove the > > correctness after chaos injection. > > > > As for what we got for now, the current problems we encounter and need to > > simulating are: > > > > 1. the connection with etcd is unstable > > 2. etcd failure > > 3. problems when cpu/memory/disk stressed out > > > > And the method to test correctness including: > > > > 1. error log of Nginx and APISIX > > 2. whether cpu/memory use of APISIX is abnormally high > > 3. whether wrk benchmarking would fail > > > > Welcome provide some other problems or correctness that you might find > > useful to this~ > > > > > > [1] <https://chaos-mesh.org/>https://chaos-mesh.org/ > > > > [2] <https://chaos-mesh.org/docs/chaos_experiments> > > https://chaos-mesh.org/docs/chaos_experiments > > > > > > Thanks, > > > > Shuyang Wu > > > -- *MembPhis* My GitHub: https://github.com/membphis Apache APISIX: https://github.com/apache/apisix
