I think prometheus is a good idea, we can get the distribution of http response codes, the number of requests, etc., but there is a question that needs to be considered: What if the prometheus service goes crash in the chaos mesh?
Thanks, Ming Wen, Apache APISIX PMC Chair Twitter: _WenMing YuanSheng Wang <[email protected]> 于2021年1月19日周二 上午9:42写道: > implement a demo script is useful, agree +1 > > > By 1.20 (Wed): finish writing the demo script, and present the metrics > > of APISIX with Grafana > > I have some doubts, why do we need to use grafana here? > If it is done in ci, it seems easier to access prometheus directly. > > > On Mon, Jan 18, 2021 at 11:45 PM Shuyang Wu <[email protected]> wrote: > > > Hi Community, > > > > It's a bit shame and awkward to resume this feature this late ;( But > gladly > > I have some new thoughts about it: > > > > After some more investigation of how people make use of chaos > engineering, > > to get how things going after certain chaos takes effect, it would be > > better to use Prometheus/Grafana to plot the metrics of APISIX > performance, > > rather than only focusing on nginx logs. Also, since chaos is more about > > mocking problems facing in production, directly using monitoring tools > > could let us get what users are facing. > > > > To use Prometheus, we need a demo to run basic functions of APISIX, like > a > > certain amount of traffic, and new rules set by a certain time interval. > It > > seems we do not have that kind of demo, so maybe I plan to write a simple > > script to implement these features. > > > > With monitoring tools and the demo, we could then easily run different > > kinds of chaos, and see how things going. When we found something > > interesting and useful, we could then standardize it, write a test case > of > > the scenario, and put it into CI. With experiments before, testify > certain > > case is not that hard, so what we should focus more on is to find those > > interesting scenarios. > > > > A rough time plan would be: > > By 1.20 (Wed): finish writing the demo script, and present the > metrics > > of APISIX with Grafana > > By 1.22 (Fri): apply network chaos and see how APISIX works > without > > etcd. Better test with different chaos cases > > By 1.24 (Sun): write test case about the network chaos, and running > > on CI > > Future: more chaos cases! > > > > The most uncertain part for me is the demo that I'm both unsure about if > we > > have that kind of demo or if we don't, some details about writing the > > script (like what is normal traffic for APISIX). > > Any suggestions are welcome!! > > > > Best, > > Shuyang > > > > Shuyang Wu <[email protected]> 于2020年11月16日周一 下午12:44写道: > > > > > Hi Comunity, > > > > > > Nowadays, we have unit tests, integration tests, and e2e tests, to > ensure > > > the fault tolerance of APISIX. But there are still some problems, like > > > network delay and CPU stress, that have not covered by the above tests. > > > Thus, it would be a better idea to introduce chaos engineering, to > > simulate > > > different types of faults, and test the performance of APISIX in these > > > circumstances. > > > > > > To deploy chaos engineering, ChaosMesh[1] could be a good choice for > us. > > > There are several benefits above other chaos engineering tools: > > > > > > 1. ChaosMesh is a CNCF sandbox project and has quite an active > > > community, which ensures the project would be better and we could > get > > help > > > when needed. > > > 2. ChaosMesh support Github Actions, so when we set up the workflow > of > > > this integration, it would be easy to do the test in our daily > working > > > 3. ChaosMesh currently supports most types of different chaos for > now > > > and is supporting more. Although we might not need that much for > now, > > it is > > > a good point when we decide to test more with it. > > > BTW, chaos types ChaosMesh supports[2] for now(Nov.16, 2020) > includes > > > pod chaos, network chaos, stress chaos, io chaos, time chaos, kernel > > chaos, > > > HTTP chaos, and DNS chaos. > > > > > > Following the principles of chaos engineering, there are two main parts > > we > > > need to care about: 1. what should we test and 2. how to prove the > > > correctness after chaos injection. > > > > > > As for what we got for now, the current problems we encounter and need > to > > > simulating are: > > > > > > 1. the connection with etcd is unstable > > > 2. etcd failure > > > 3. problems when cpu/memory/disk stressed out > > > > > > And the method to test correctness including: > > > > > > 1. error log of Nginx and APISIX > > > 2. whether cpu/memory use of APISIX is abnormally high > > > 3. whether wrk benchmarking would fail > > > > > > Welcome provide some other problems or correctness that you might find > > > useful to this~ > > > > > > > > > [1] <https://chaos-mesh.org/>https://chaos-mesh.org/ > > > > > > [2] <https://chaos-mesh.org/docs/chaos_experiments> > > > https://chaos-mesh.org/docs/chaos_experiments > > > > > > > > > Thanks, > > > > > > Shuyang Wu > > > > > > > > -- > > *MembPhis* > My GitHub: https://github.com/membphis > Apache APISIX: https://github.com/apache/apisix >
