Re: Failed Dtest will block cutting releases

2016-12-04 Thread Benjamin Roth
Hi Michael,

Thanks for this update. As a newbie it helped me to understand the
organization and processes a little bit better.

I don't know how many CS-devs know this but I love this rule (actually the
whole book):
http://programmer.97things.oreilly.com/wiki/index.php/The_Boy_Scout_Rule

I personally, to be honest, am not the kind of guy that walks through lists
and looks for issues that could be picked up and done, but if I encounter
anything (test, some weird code, design, whatever) that deserves to be
improved, analyzed or fixed and I have a little time left, I try to improve
or fix it.

At that time I am still quite new around here and in process of
understanding the whole picture of Cassandras behaviour, code, processes
and organization. I hope you can forgive me if I don't perfectly get the
point every time right now - but I am eager to learn and improve.

Thanks for your patience!

2016-12-04 19:33 GMT+01:00 Michael Shuler :

> Thanks for your thoughts on testing Apache Cassandra, I share them.
>
> I just wanted to note that the known_failure() annotations were recently
> removed from cassandra-dtest [0], due to lack of annotation removal when
> bugs fixed, and the internal webapp that we were using to parse has been
> broken for quite some time, with no fix in sight. The webapp was removed
> and we dropped all the known_failure() annotations.
>
> The test-failure JIRA label [1] is what we've been using during test run
> triage. Those tickets assigned to 'DS Test Eng' need figuring out if
> it's a test problem or Cassandra problem. Typically, the Unassigned
> tickets were determined to be possibly a Cassandra issue. If you enjoy
> test analysis and fixing them, please, jump in and analyze/fix them!
>
> [0] https://github.com/riptano/cassandra-dtest/pull/1399
> [1]
> https://issues.apache.org/jira/issues/?jql=project%20%
> 3D%20CASSANDRA%20AND%20labels%20%3D%20test-failure%20AND%
> 20resolution%20%3D%20unresolved
>
> --
> Kind regards,
> Michael Shuler
>
> On 12/04/2016 02:07 AM, Benjamin Roth wrote:
> > Sorry for jumping in so boldly before.
> >
> > TL;DR:
> >
> >- I didn't mean to delete every flaky test just like that
> >- To improve quality, each failing test has to be analyzed
> individually
> >for release
> >
> > More thoughts on that:
> >
> > I had a closer look on some of the tests tagged as flaky and realized
> that
> > the situation here is more complex than I thought before.
> > Of course I didn't mean to delete all the flaky tests just like that.
> Maybe
> > I should rephrase it a bit to "If a (flaky) test can't really prove
> > something, then it is better not to have it". If a test does prove
> > something depends on its intention, its implementation and on how flaky
> it
> > really is and first of all: Why.
> >
> > These dtests are maybe blessing and curse at the same time. On the one
> hand
> > there are things you cannot test with a unit test, so you need them for
> > certain cases. On the other hand, dtest do not only test the desired
> case.
> >
> >- They test the test environment (ccm, server hickups) and more or
> less
> >all components of the CS daemon that are somehow involved as well.
> >- This exposes the test to many more error sources than the bare test
> >case and that creates of course a lot of "unreliability" in general
> and
> >causes flaky results.
> >- It makes it hard to pin down the failures to a certain cause like
> >   - Flaky test implementation
> >   - Flaky bugs in SUT
> >   - Unreliable test environment
> >- Analyzing every failure is a pain. But a simple "retry and skip
> over"
> >_may_ mask a real problem.
> >
> > => Difficult situation!
> >
> > From my own projects and non-CS experience I can tell:
> > Flaky tests give me a bad feeling and always leave a certain smell. I've
> > also just skipped them with that reason "Yes, I know it's flaky, I don't
> > really care about it". But it simply does not feel right.
> >
> > A real life example from another project:
> > Some weeks ago I wrote functional tests to test the integration of
> > SeaweedFS as a blob store backend in an image upload process. Test case
> was
> > roughly to upload an image, check if it exists on both old and new image
> > storage, delete it, check it again. The test existed for years. I simply
> > added some assertions to check the existance of the uploaded files on the
> > new storage. Funnyhow, I must have hit some corner case by that and from
> > that moment on, the test was flaky. Simple URL checks started to time out
> > from time to time. That made me really curios. To cut a long story short:
> > After having checked a whole lot of things, it turned out that not the
> test
> > was flaky and also not the shiny new storagy, it was the LVS
> loadbalancer.
> > The loadbalancer dropped connections reproducibly which happened more
> > likely with increasing concurrency. Finally we removed LVS completely and
> > replaced it by DNS-RR 

Re: Failed Dtest will block cutting releases

2016-12-04 Thread Michael Shuler
Thanks for your thoughts on testing Apache Cassandra, I share them.

I just wanted to note that the known_failure() annotations were recently
removed from cassandra-dtest [0], due to lack of annotation removal when
bugs fixed, and the internal webapp that we were using to parse has been
broken for quite some time, with no fix in sight. The webapp was removed
and we dropped all the known_failure() annotations.

The test-failure JIRA label [1] is what we've been using during test run
triage. Those tickets assigned to 'DS Test Eng' need figuring out if
it's a test problem or Cassandra problem. Typically, the Unassigned
tickets were determined to be possibly a Cassandra issue. If you enjoy
test analysis and fixing them, please, jump in and analyze/fix them!

[0] https://github.com/riptano/cassandra-dtest/pull/1399
[1]
https://issues.apache.org/jira/issues/?jql=project%20%3D%20CASSANDRA%20AND%20labels%20%3D%20test-failure%20AND%20resolution%20%3D%20unresolved

-- 
Kind regards,
Michael Shuler

On 12/04/2016 02:07 AM, Benjamin Roth wrote:
> Sorry for jumping in so boldly before.
> 
> TL;DR:
> 
>- I didn't mean to delete every flaky test just like that
>- To improve quality, each failing test has to be analyzed individually
>for release
> 
> More thoughts on that:
> 
> I had a closer look on some of the tests tagged as flaky and realized that
> the situation here is more complex than I thought before.
> Of course I didn't mean to delete all the flaky tests just like that. Maybe
> I should rephrase it a bit to "If a (flaky) test can't really prove
> something, then it is better not to have it". If a test does prove
> something depends on its intention, its implementation and on how flaky it
> really is and first of all: Why.
> 
> These dtests are maybe blessing and curse at the same time. On the one hand
> there are things you cannot test with a unit test, so you need them for
> certain cases. On the other hand, dtest do not only test the desired case.
> 
>- They test the test environment (ccm, server hickups) and more or less
>all components of the CS daemon that are somehow involved as well.
>- This exposes the test to many more error sources than the bare test
>case and that creates of course a lot of "unreliability" in general and
>causes flaky results.
>- It makes it hard to pin down the failures to a certain cause like
>   - Flaky test implementation
>   - Flaky bugs in SUT
>   - Unreliable test environment
>- Analyzing every failure is a pain. But a simple "retry and skip over"
>_may_ mask a real problem.
> 
> => Difficult situation!
> 
> From my own projects and non-CS experience I can tell:
> Flaky tests give me a bad feeling and always leave a certain smell. I've
> also just skipped them with that reason "Yes, I know it's flaky, I don't
> really care about it". But it simply does not feel right.
> 
> A real life example from another project:
> Some weeks ago I wrote functional tests to test the integration of
> SeaweedFS as a blob store backend in an image upload process. Test case was
> roughly to upload an image, check if it exists on both old and new image
> storage, delete it, check it again. The test existed for years. I simply
> added some assertions to check the existance of the uploaded files on the
> new storage. Funnyhow, I must have hit some corner case by that and from
> that moment on, the test was flaky. Simple URL checks started to time out
> from time to time. That made me really curios. To cut a long story short:
> After having checked a whole lot of things, it turned out that not the test
> was flaky and also not the shiny new storagy, it was the LVS loadbalancer.
> The loadbalancer dropped connections reproducibly which happened more
> likely with increasing concurrency. Finally we removed LVS completely and
> replaced it by DNS-RR + VRRP, which completely solved the problem and the
> tests ran happily ever after.
> 
> Usually there is no pure black and white.
> 
>- Sometimes testing whole systems reveals problems you'd never
>have found without them
>- Sometimes they cause false alerts
>- Sometimes, skipping them masks real problems
>- Sometimes it sucks if a false alert blocks your release
> 
> If you want to be really safe, you have to analyze every single failure and
> decide of what kind this failure is or could be and if a retry will prove
> sth or not. At least when you are at a release gate. I think this should be
> worth it.
> 
> There's a reason for this thread and there's a reason why people ask every
> few days which CS version is production stable. Things have to improve over
> time. This applies to test implementations, test environments, release
> processes, and so on. One way to do this is to become a little bit stricter
> (and a bit better) with every release. Making all tests pass at least once
> before a release should be a rather low hanging fruit. Reducing the total
> number of flaky tests 

Re: Failed Dtest will block cutting releases

2016-12-04 Thread Benjamin Roth
Sorry for jumping in so boldly before.

TL;DR:

   - I didn't mean to delete every flaky test just like that
   - To improve quality, each failing test has to be analyzed individually
   for release

More thoughts on that:

I had a closer look on some of the tests tagged as flaky and realized that
the situation here is more complex than I thought before.
Of course I didn't mean to delete all the flaky tests just like that. Maybe
I should rephrase it a bit to "If a (flaky) test can't really prove
something, then it is better not to have it". If a test does prove
something depends on its intention, its implementation and on how flaky it
really is and first of all: Why.

These dtests are maybe blessing and curse at the same time. On the one hand
there are things you cannot test with a unit test, so you need them for
certain cases. On the other hand, dtest do not only test the desired case.

   - They test the test environment (ccm, server hickups) and more or less
   all components of the CS daemon that are somehow involved as well.
   - This exposes the test to many more error sources than the bare test
   case and that creates of course a lot of "unreliability" in general and
   causes flaky results.
   - It makes it hard to pin down the failures to a certain cause like
  - Flaky test implementation
  - Flaky bugs in SUT
  - Unreliable test environment
   - Analyzing every failure is a pain. But a simple "retry and skip over"
   _may_ mask a real problem.

=> Difficult situation!

>From my own projects and non-CS experience I can tell:
Flaky tests give me a bad feeling and always leave a certain smell. I've
also just skipped them with that reason "Yes, I know it's flaky, I don't
really care about it". But it simply does not feel right.

A real life example from another project:
Some weeks ago I wrote functional tests to test the integration of
SeaweedFS as a blob store backend in an image upload process. Test case was
roughly to upload an image, check if it exists on both old and new image
storage, delete it, check it again. The test existed for years. I simply
added some assertions to check the existance of the uploaded files on the
new storage. Funnyhow, I must have hit some corner case by that and from
that moment on, the test was flaky. Simple URL checks started to time out
from time to time. That made me really curios. To cut a long story short:
After having checked a whole lot of things, it turned out that not the test
was flaky and also not the shiny new storagy, it was the LVS loadbalancer.
The loadbalancer dropped connections reproducibly which happened more
likely with increasing concurrency. Finally we removed LVS completely and
replaced it by DNS-RR + VRRP, which completely solved the problem and the
tests ran happily ever after.

Usually there is no pure black and white.

   - Sometimes testing whole systems reveals problems you'd never
   have found without them
   - Sometimes they cause false alerts
   - Sometimes, skipping them masks real problems
   - Sometimes it sucks if a false alert blocks your release

If you want to be really safe, you have to analyze every single failure and
decide of what kind this failure is or could be and if a retry will prove
sth or not. At least when you are at a release gate. I think this should be
worth it.

There's a reason for this thread and there's a reason why people ask every
few days which CS version is production stable. Things have to improve over
time. This applies to test implementations, test environments, release
processes, and so on. One way to do this is to become a little bit stricter
(and a bit better) with every release. Making all tests pass at least once
before a release should be a rather low hanging fruit. Reducing the total
number of flaky tests or the "flaky-fail-rate" may be another future goal.

Btw, the fact of the day:
I grepped through dtests and found out that roughly 11% of all tests are
flagged with "known_failure" and roughly 8% of all tests are flagged with
"flaky". Quite impressive.


2016-12-03 15:52 GMT+01:00 Edward Capriolo :

> I think it is fair to run a flakey test again. If it is determine it flaked
> out due to a conflict with another test or something ephemeral in a long
> process it is not worth blocking a release.
>
> Just deleting it is probably not a good path.
>
> I actually enjoy writing fixing, tweeking, tests so pinge offline or
> whatever.
>
> On Saturday, December 3, 2016, Benjamin Roth 
> wrote:
>
> > Excuse me if I jump into an old thread, but from my experience, I have a
> > very clear opinion about situations like that as I encountered them
> before:
> >
> > Tests are there to give *certainty*.
> > *Would you like to pass a crossing with a green light if you cannot be
> sure
> > if green really means green?*
> > Do you want to rely on tests that are green, red, green, red? What if a
> red
> > is a real red and you missed it because you simply ignore it because it's
> > flaky?
> >
> > IMHO t