Re: Any ideas how to make dtests more stable and reproducible?

Stefan Miklosovic Thu, 21 Mar 2019 21:01:53 -0700

Hi Dinesh

the problem with the current dtests is that, ironically, when you run them
on too powerful machine as it is in my case, it generates so much stress
via cassandra-stress tool for some tests that these nodes become
unresponsive and they are killed so test can not proceed.


I was testing this on c5.9xlarge which has 36 cores and 64 gigs of memory
and I have run this test in particular (1).

The test flow is rather simple, it creates 2 nodes, each in one DC and it
generates 2M of inserts with 50 threads. After this is done, it starts the
third node in dc2 where we dont wait until it is started fully and we test
that the cluster can cope with that while bootstrapping.

Your nodes have -Xmx and -Xms set to 512M and that is very, very low
figure. In order to simulate this behaviour to match specs with your
circleci container, I have additionally
set -Dcassandra.available_processors=8 per node. Then I did as is done in
test itself:

cassandra-stress user profile=config.yaml n=2M no-warmup ops\(insert=1\)
-rate threads=50 -node 172.19.0.5

The node1 was killed after some time and stress tool timeouted which gives
the very same results as I have observed by running dtest itself.

Strangely enough, by mistake I was loading initially data only to dc1 and
in that case it survived even with 512M but when doing dc1 and dc2 for
keyspace1.user, it exploded.

Increasing Xmx and Xms to 1G seems to do the job and they survive but I am
still seeing timeouts occasionally.

So in general I see three fixes:

1) Increase memory per node to something sensible, at least 1G, more is
better
2) Fix the test in such way that I does not timeout even with 512M per node
3) Run tests on machine with 8 cores and 16 GB as you do but that seems to
be like a stupid idea in general

I can imagine this will be the issue for a lot of tests and by increasing
the memory per node we could get rid of a lot problems. I was briefly
checking where the memory is set in dtests per node but without success. I
think its job of ccm. Is there a simple way how to increase memory per node
in dtests?

Regards

(1)
https://github.com/apache/cassandra-dtest/blob/master/bootstrap_test.py#L419-L470

On Tue, 19 Mar 2019 at 17:22, Dinesh Joshi <djos...@icloud.com.invalid>
wrote:

> Hi Stefan,
>
> The dtests have been typically flaky but are more or less stable in the
> recent past. We are working towards stabilizing them. For the dev workflow
> locally, I typically end up running a subset of the dtests via the pytest
> runner. I am not sure how others run it.
>
> I believe CircleCI results are the most consistent and accurate so far.
> You can refer to this[1] recent sample run. All tests passed. The CircleCI
> workflow has changed recently so it'll look different now but the point
> being that the tests are more or less stable. I would caution you if you're
> running on a free tier, it'll take a lot of time and the test results are
> unreliable as the free tier does not have enough resources. To compare your
> setup, we typically run the dtests in 100 CircleCI containers concurrently.
> Each container has 8 VCPUs and 16GB RAM. The run takes 20-30 minutes
> depending on the resource availability.
>
> Thanks,
>
> Dinesh
>
>
> [1] https://circleci.com/workflow-run/80804bb2-dafb-445a-acca-53401ca02806
> <https://circleci.com/workflow-run/80804bb2-dafb-445a-acca-53401ca02806>
>
>
> > On Mar 18, 2019, at 5:46 PM, Stefan Miklosovic <
> stefan.mikloso...@instaclustr.com> wrote:
> >
> > Hi,
> >
> > I am running large and "simple" dtests (executed via
> > cassandra-builds/build-scripts/cassandra-dtest-pytest.sh) and I find
> myself
> > quite frustrated as I do not know if there are errors because tests are
> > flaky or there are legit issues which produced them.
> >
> > It is "simple" to check it one by one when tests are stable and there is
> > couple of them but when there are hundreds of tests, whole test run takes
> > ~7 hours and it is not stable, it is like finding a needle in a haystack.
> > Sometimes 15 tests fail, sometimes just 10 ... Sometimes there are
> > timeouts, sometimes not.
> >
> > For basic dtests I am getting stable three errors out of 900 I think
> which
> > quite good. I supplied one patch here (1) so only two of them are failing
> > now consistently (it is not merged yet).
> >
> > Can you point me to your builds and what results you are getting there?
> > Maybe something is wrong with my setup or these dtests are "expected" to
> be
> > flaky from time to time?
> >
> > What stability are you getting with official builds when it comes to
> > dtests? How often they are run? As part of every pull request / change?
> Do
> > you commit only on "0 dtests failed"?
> >
> > Are there some recommendations as on what setup and machine these tests
> > should run? I am running them on c5.9xlarge (36 cores with 64 GB or
> memory)
> > on fairly recent Ubuntu with latest Java 8. I am trying to supply all
> > needed parameters and libs in order to start Cassandra smoothly without
> any
> > warnings / errors (there are these checks which check if your environment
> > is all fine).
> >
> > I am testing current trunk.
> >
> > Thanks for any input how to make them more stable if there are some tips
> > and tricks.
> >
> > (1) https://github.com/apache/cassandra-dtest/pull/47
> >
> > Stefan Miklosovic
>
>

Re: Any ideas how to make dtests more stable and reproducible?

Reply via email to