Re: Proposal: deploy ChaosMesh on APISIX, to simulate more faults

Wo Soyoung Mon, 18 Jan 2021 23:05:02 -0800

Hi Ming,

I think that's not a problem for us. The target of chaos mesh are all
virtual machines, like we normally limit the chaos scope to a certain
kubernetes node or pod, so it won't affect components outside that scope.


Ming Wen <[email protected]> 于2021年1月19日周二 下午2:53写道：

> I think prometheus is a good idea, we can get the distribution of http
> response codes, the number of requests, etc.,
> but there is a question that needs to be considered: What if the prometheus
> service goes crash in the chaos mesh?
>
> Thanks,
> Ming Wen, Apache APISIX PMC Chair
> Twitter: _WenMing
>
>
> YuanSheng Wang <[email protected]> 于2021年1月19日周二 上午9:42写道：
>
> > implement a demo script is useful, agree +1
> >
> > >  By 1.20 (Wed):  finish writing the demo script, and present the
> metrics
> > > of APISIX with Grafana
> >
> > I have some doubts, why do we need to use grafana here?
> > If it is done in ci, it seems easier to access prometheus directly.
> >
> >
> > On Mon, Jan 18, 2021 at 11:45 PM Shuyang Wu <[email protected]> wrote:
> >
> > > Hi Community,
> > >
> > > It's a bit shame and awkward to resume this feature this late ;( But
> > gladly
> > > I have some new thoughts about it:
> > >
> > > After some more investigation of how people make use of chaos
> > engineering,
> > > to get how things going after certain chaos takes effect, it would be
> > > better to use Prometheus/Grafana to plot the metrics of APISIX
> > performance,
> > > rather than only focusing on nginx logs. Also, since chaos is more
> about
> > > mocking problems facing in production, directly using monitoring tools
> > > could let us get what users are facing.
> > >
> > > To use Prometheus, we need a demo to run basic functions of APISIX,
> like
> > a
> > > certain amount of traffic, and new rules set by a certain time
> interval.
> > It
> > > seems we do not have that kind of demo, so maybe I plan to write a
> simple
> > > script to implement these features.
> > >
> > > With monitoring tools and the demo, we could then easily run different
> > > kinds of chaos, and see how things going. When we found something
> > > interesting and useful, we could then standardize it, write a test case
> > of
> > > the scenario, and put it into CI. With experiments before, testify
> > certain
> > > case is not that hard, so what we should focus more on is to find those
> > > interesting scenarios.
> > >
> > > A rough time plan would be:
> > >     By 1.20 (Wed):  finish writing the demo script, and present the
> > metrics
> > > of APISIX with Grafana
> > >     By 1.22 (Fri):     apply network chaos and see how APISIX works
> > without
> > > etcd. Better test with different chaos cases
> > >     By 1.24 (Sun):   write test case about the network chaos, and
> running
> > > on CI
> > >     Future:              more chaos cases!
> > >
> > > The most uncertain part for me is the demo that I'm both unsure about
> if
> > we
> > > have that kind of demo or if we don't, some details about writing the
> > > script (like what is normal traffic for APISIX).
> > > Any suggestions are welcome!!
> > >
> > > Best,
> > > Shuyang
> > >
> > > Shuyang Wu <[email protected]> 于2020年11月16日周一 下午12:44写道：
> > >
> > > > Hi Comunity,
> > > >
> > > > Nowadays, we have unit tests, integration tests, and e2e tests, to
> > ensure
> > > > the fault tolerance of APISIX. But there are still some problems,
> like
> > > > network delay and CPU stress, that have not covered by the above
> tests.
> > > > Thus, it would be a better idea to introduce chaos engineering, to
> > > simulate
> > > > different types of faults, and test the performance of APISIX in
> these
> > > > circumstances.
> > > >
> > > > To deploy chaos engineering, ChaosMesh[1] could be a good choice for
> > us.
> > > > There are several benefits above other chaos engineering tools:
> > > >
> > > >    1. ChaosMesh is a CNCF sandbox project and has quite an active
> > > >    community, which ensures the project would be better and we could
> > get
> > > help
> > > >    when needed.
> > > >    2. ChaosMesh support Github Actions, so when we set up the
> workflow
> > of
> > > >    this integration, it would be easy to do the test in our daily
> > working
> > > >    3. ChaosMesh currently supports most types of different chaos for
> > now
> > > >    and is supporting more. Although we might not need that much for
> > now,
> > > it is
> > > >    a good point when we decide to test more with it.
> > > >    BTW, chaos types ChaosMesh supports[2] for now(Nov.16, 2020)
> > includes
> > > >    pod chaos, network chaos, stress chaos, io chaos, time chaos,
> kernel
> > > chaos,
> > > >    HTTP chaos, and DNS chaos.
> > > >
> > > > Following the principles of chaos engineering, there are two main
> parts
> > > we
> > > > need to care about: 1. what should we test and 2. how to prove the
> > > > correctness after chaos injection.
> > > >
> > > > As for what we got for now, the current problems we encounter and
> need
> > to
> > > > simulating are:
> > > >
> > > >    1. the connection with etcd is unstable
> > > >    2. etcd failure
> > > >    3. problems when cpu/memory/disk stressed out
> > > >
> > > > And the method to test correctness including:
> > > >
> > > >    1. error log of Nginx and APISIX
> > > >    2. whether cpu/memory use of APISIX is abnormally high
> > > >    3. whether wrk benchmarking would fail
> > > >
> > > > Welcome provide some other problems or correctness that you might
> find
> > > > useful to this~
> > > >
> > > >
> > > > [1] <https://chaos-mesh.org/>https://chaos-mesh.org/
> > > >
> > > > [2] <https://chaos-mesh.org/docs/chaos_experiments>
> > > > https://chaos-mesh.org/docs/chaos_experiments
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Shuyang Wu
> > > >
> > >
> >
> >
> > --
> >
> > *MembPhis*
> > My GitHub: https://github.com/membphis
> > Apache APISIX: https://github.com/apache/apisix
> >
>

Re: Proposal: deploy ChaosMesh on APISIX, to simulate more faults

Reply via email to