Re: Proposal: deploy ChaosMesh on APISIX, to simulate more faults

Ming Wen Mon, 18 Jan 2021 22:53:14 -0800

I think prometheus is a good idea, we can get the distribution of http
response codes, the number of requests, etc.,
but there is a question that needs to be considered: What if the prometheus
service goes crash in the chaos mesh?


Thanks,
Ming Wen, Apache APISIX PMC Chair
Twitter: _WenMing


YuanSheng Wang <[email protected]> 于2021年1月19日周二 上午9:42写道：

> implement a demo script is useful, agree +1
>
> >  By 1.20 (Wed):  finish writing the demo script, and present the metrics
> > of APISIX with Grafana
>
> I have some doubts, why do we need to use grafana here?
> If it is done in ci, it seems easier to access prometheus directly.
>
>
> On Mon, Jan 18, 2021 at 11:45 PM Shuyang Wu <[email protected]> wrote:
>
> > Hi Community,
> >
> > It's a bit shame and awkward to resume this feature this late ;( But
> gladly
> > I have some new thoughts about it:
> >
> > After some more investigation of how people make use of chaos
> engineering,
> > to get how things going after certain chaos takes effect, it would be
> > better to use Prometheus/Grafana to plot the metrics of APISIX
> performance,
> > rather than only focusing on nginx logs. Also, since chaos is more about
> > mocking problems facing in production, directly using monitoring tools
> > could let us get what users are facing.
> >
> > To use Prometheus, we need a demo to run basic functions of APISIX, like
> a
> > certain amount of traffic, and new rules set by a certain time interval.
> It
> > seems we do not have that kind of demo, so maybe I plan to write a simple
> > script to implement these features.
> >
> > With monitoring tools and the demo, we could then easily run different
> > kinds of chaos, and see how things going. When we found something
> > interesting and useful, we could then standardize it, write a test case
> of
> > the scenario, and put it into CI. With experiments before, testify
> certain
> > case is not that hard, so what we should focus more on is to find those
> > interesting scenarios.
> >
> > A rough time plan would be:
> >     By 1.20 (Wed):  finish writing the demo script, and present the
> metrics
> > of APISIX with Grafana
> >     By 1.22 (Fri):     apply network chaos and see how APISIX works
> without
> > etcd. Better test with different chaos cases
> >     By 1.24 (Sun):   write test case about the network chaos, and running
> > on CI
> >     Future:              more chaos cases!
> >
> > The most uncertain part for me is the demo that I'm both unsure about if
> we
> > have that kind of demo or if we don't, some details about writing the
> > script (like what is normal traffic for APISIX).
> > Any suggestions are welcome!!
> >
> > Best,
> > Shuyang
> >
> > Shuyang Wu <[email protected]> 于2020年11月16日周一 下午12:44写道：
> >
> > > Hi Comunity,
> > >
> > > Nowadays, we have unit tests, integration tests, and e2e tests, to
> ensure
> > > the fault tolerance of APISIX. But there are still some problems, like
> > > network delay and CPU stress, that have not covered by the above tests.
> > > Thus, it would be a better idea to introduce chaos engineering, to
> > simulate
> > > different types of faults, and test the performance of APISIX in these
> > > circumstances.
> > >
> > > To deploy chaos engineering, ChaosMesh[1] could be a good choice for
> us.
> > > There are several benefits above other chaos engineering tools:
> > >
> > >    1. ChaosMesh is a CNCF sandbox project and has quite an active
> > >    community, which ensures the project would be better and we could
> get
> > help
> > >    when needed.
> > >    2. ChaosMesh support Github Actions, so when we set up the workflow
> of
> > >    this integration, it would be easy to do the test in our daily
> working
> > >    3. ChaosMesh currently supports most types of different chaos for
> now
> > >    and is supporting more. Although we might not need that much for
> now,
> > it is
> > >    a good point when we decide to test more with it.
> > >    BTW, chaos types ChaosMesh supports[2] for now(Nov.16, 2020)
> includes
> > >    pod chaos, network chaos, stress chaos, io chaos, time chaos, kernel
> > chaos,
> > >    HTTP chaos, and DNS chaos.
> > >
> > > Following the principles of chaos engineering, there are two main parts
> > we
> > > need to care about: 1. what should we test and 2. how to prove the
> > > correctness after chaos injection.
> > >
> > > As for what we got for now, the current problems we encounter and need
> to
> > > simulating are:
> > >
> > >    1. the connection with etcd is unstable
> > >    2. etcd failure
> > >    3. problems when cpu/memory/disk stressed out
> > >
> > > And the method to test correctness including:
> > >
> > >    1. error log of Nginx and APISIX
> > >    2. whether cpu/memory use of APISIX is abnormally high
> > >    3. whether wrk benchmarking would fail
> > >
> > > Welcome provide some other problems or correctness that you might find
> > > useful to this~
> > >
> > >
> > > [1] <https://chaos-mesh.org/>https://chaos-mesh.org/
> > >
> > > [2] <https://chaos-mesh.org/docs/chaos_experiments>
> > > https://chaos-mesh.org/docs/chaos_experiments
> > >
> > >
> > > Thanks,
> > >
> > > Shuyang Wu
> > >
> >
>
>
> --
>
> *MembPhis*
> My GitHub: https://github.com/membphis
> Apache APISIX: https://github.com/apache/apisix
>

Re: Proposal: deploy ChaosMesh on APISIX, to simulate more faults

Reply via email to