[jira] [Comment Edited] (SOLR-10032) Create report to assess Solr test quality at a commit point.

2017-02-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858889#comment-15858889
 ] 

Mark Miller edited comment on SOLR-10032 at 2/9/17 2:26 AM:


For this next report I have switched to an 8 core machine from a 16 core 
machine. It looks like that may have made some of the more resource/env 
sensitive tests pop out a little more. The first report was created on a single 
machine, so I went with 16 cores just to try and generate it as fast as 
possible. 16-cores was not strictly needed, I run 10 tests at a time on my 
6-core machine with similar results. It may even be a little too much CPU for 
our use case, even when running 10 instances of the test in parallel.

I have moved on from just using one machine though. It actually basically took 
2-3 days to generate the first report as I was still working out some speed 
issues. The First run had like 2 minutes and 40 seconds of 'build' overtime per 
test run for most of the report and just barely enough RAM to handle 10 tests 
at a time - for a few test fails on heavy tests (eg hdfs), not enough RAM as 
there is also no swap space on those machines. Anyway, beasting ~900 tests is 
time consuming even in the best case. 

Two tests also hung and that slowed things up a bit. Now I am more on the 
lookout for that - I've @BadAppled a test method involved in producing one of 
the hangs, and for this report I locally @BadAppled the other. They both look 
like legit bugs to me. I should have done @Ignore for the second hang, the test 
report runs @BadApple and @AwaitFix.  Losing one machine for a long time when 
you are using 10 costs you a lot in report creation time. Now I at least know 
to pay attention to my email while running reports though. Luckily, these 
instance I'm using will auto pause after 30 minutes of no real activity and I 
get an email, so I now I can be a bit more vigilant while creating the report. 
Also helps that I've gotten down to about 4 hours to create the report.

I used 5 16-core machines for the second report. I can't recall about how long 
that took, but it was still in the realm of an all night job.

For this third report I am using 10 8-core machines.

I think we should be using those annotations like this:

* @AwaitsFix - we basically know something key is broken and it's fairly clear 
what the issue is - we are waiting for someone to fix it - you don't expect 
this to be run regularly, but you can just pass a system property to run them.
* @BadApple - test is too flakey, fails too much for unknown or varied reasons 
- you do expect that some test runs would still or could still include these 
tests and give some useful coverage information - flakiness in many more 
integration type tests can be the result of unrelated issues and clear up over 
time. Or get worse.
* @Ignore - test is never run, it can hang, OOM, or does something negative to 
other tests.

I'll put up another report soon. I probably won't do another one until I have 
tackled the above flakey rating issues, hoping that's just a couple to a few 
weeks at most, but may be wishful.


was (Author: markrmil...@gmail.com):
For this next report I have switched to an 8 core machine from a 16 core 
machine. It looks like that may have made some of the more resource/env 
sensitive tests pop out a little more. The first report was created on a single 
machine, so I went with 16 cores just to try and generate it as fast as 
possible. 16-cores was not strictly needed, I run 10 at a time on my 6-core 
machine with similar results. It may even be a little too much CPU for our use 
case, even when running 10 instances of the test in parallel.

I have moved on from just using one machine though. It actually basically took 
2-3 days to generate the first report as I was still working out some speed 
issues. The First run had like 2 minutes and 40 seconds of 'build' overtime per 
test run for most of the report and just barely enough RAM to handle 10 tests 
at a time - for a few test fails on heavy tests (eg hdfs), not enough RAM as 
there is also no swap space on those machines. Anyway, beasting ~900 tests is 
time consuming even in the best case. 

Two tests also hung and that slowed things up a bit. Now I am more on the 
lookout for that - I've @BadAppled a test method involved in producing one of 
the hangs, and for this report I locally @BadAppled the other. They both look 
like legit bugs to me. I should have done @Ignore for the second hang, the test 
report runs @BadApple and @AwaitFix.  Losing one machine for a long time when 
you are using 10 costs you a lot in report creation time. Now I at least know 
to pay attention to my email while running reports though. Luckily, these 
instance I'm using will auto pause after 30 minutes of no real activity and I 
get an email, so I now I can be a bit more vigilant while creating the report. 
Also helps 

[jira] [Comment Edited] (SOLR-10032) Create report to assess Solr test quality at a commit point.

2017-02-03 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851620#comment-15851620
 ] 

Mark Miller edited comment on SOLR-10032 at 2/3/17 3:36 PM:


This report is not made to reproduce all fails.

Some tests will fail for a variety of reasons. Resources are too low, Java/OS 
version, which tests they happen to run against, etc, etc.

So this will not exhaustively produce all flakey tests, nor is it trying to. In 
fact, I've tried to make sure there are *plenty* of resources to the run the 
tests reasonably. My goal is to find flakey tests that pop out easily, and not 
due to very specific conditions. This should target and find obvious problems 
and then help clamp down on minor flakey tests in time. Jenkins and individual 
devs will still play an important role in outliers and other, hopefully much 
less common, fails.

That said, most things still end up popping out if you beast long enough in my 
experience. Beasting for 100 runs would probably surface even more flakey 
tests. Producing this test with 30 is already quite time expensive though ;) 
I'll eventually do some longer reports as we whittle down the obvious issues. 
It's really a judgment call of time vs coverage, and in these early reports 30 
seemed like a reasonable condition to pass.

The other tests are not all cleared, but here is a very reasonable list of 
tests we should focus on - that even in a good clean env appear to fail too 
much.

I will also focus 100 run or more beasting on the tests that this report 
surfaces as flakey, and likely some tests will enter and drop off the report 
from one to the next. Those tests will end up needing more extensive individual 
beasting to pass as 100% clean.

rock-solid is not really a definitive judgment, just the rating for no fails. 
If you did a single run and it passed it would be rock-solid. I can change that 
to something a little less confusing.

If you do have a specific test that seems to fail for you, I'm happy to beast 
it more extensively and let you know if fails pop out. I'll try ShardSplitTest. 
It may be that it has more severe resource problems when it ends up running 
against some other intensive tests in ant test.

It does give us some more info when looking at ShardSplitTest - we know it 
fails fairly often for you and Jenkins, but that in a clean, resource friendly 
env, it can pass for 30 runs, 10 run at a time. That gives some clues hardening 
that test.


was (Author: markrmil...@gmail.com):
This report is not made to reproduce all fails.

Some tests will fail for a variety of resources. Resources are too low, Java/OS 
version, which tests they happen to run against, etc, etc.

So this will not exhaustively produce all flakey tests, nor is it trying to. In 
fact, I've tried to make sure there are *plenty* of resources to the run the 
tests reasonably. My goal is to find flakey tests that pop out easily, and not 
due to very specific conditions. This should target and find obvious problems 
and then help clamp down on minor flakey tests in time. Jenkins and individual 
devs will still play an important role in outliers and other, hopefully much 
less common, fails.

That said, most things still end up popping out if you beast long enough in my 
experience. Beasting for 100 runs would probably surface even more flakey 
tests. Producing this test with 30 is already quite time expensive though ;) 
I'll eventually do some longer reports as we whittle down the obvious issues. 
It's really a judgment call of time vs coverage, and in these early reports 30 
seemed like a reasonable condition to pass.

The other tests are not all cleared, but here is a very reasonable list of 
tests we should focus on - that even in a good clean evn appear to fail too 
much.

I will also focus 100 run or more beasting on the tests that this report 
surfaces as flakey, and likely some tests will enter and drop off the report 
from one to the next. Those tests will end up needing more extensive individual 
beasting to pass as 100% clean.

rock-solid is not really a definitive judgment, just the rating for no fails. 
If you did a single run and it passed it would be rock-solid. I can change that 
to something a little less confusing.

If you do have a specific test that seems to fail for you, I'm happy to beast 
it more extensively and let you know if fails pop out. I'll try ShardSplitTest. 
It may be that it has more severe resource problems when it ends up running 
against some other intensive tests in ant test.

> Create report to assess Solr test quality at a commit point.
> 
>
> Key: SOLR-10032
> URL: https://issues.apache.org/jira/browse/SOLR-10032
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 

[jira] [Comment Edited] (SOLR-10032) Create report to assess Solr test quality at a commit point.

2017-02-02 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851079#comment-15851079
 ] 

Mark Miller edited comment on SOLR-10032 at 2/3/17 5:18 AM:


Here is a second test report for a commit from 2/1.

A couple fails in the first run had to do with RAM issues, so for the second 
report I used a lot more RAM and did 10 at a time instead of 12.

I've been making other small iterative improvements as well.

https://docs.google.com/spreadsheets/d/1FndoyHmihaOVL2o_Zns5alpNdAJlNsEwQVoJ4XDWj3c/edit?usp=sharing


was (Author: markrmil...@gmail.com):
Here is a second test report for a commit from 2/1.

A couple fails in the first run had to do with RAM issues, so for the second 
report I used a lot more RAM and did 10 at a time instead of 12.

I've been making other small iterative improvements as well.

https://docs.google.com/spreadsheets/d/1YeF5aU9ineL1np0K3dxYnfJqSpxAMw3G1Lfjpbj2GNk/edit?usp=sharing

> Create report to assess Solr test quality at a commit point.
> 
>
> Key: SOLR-10032
> URL: https://issues.apache.org/jira/browse/SOLR-10032
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Mark Miller
>Assignee: Mark Miller
> Attachments: Lucene-Solr Master Test Beast Results 
> 01-24-2017-9899cbd031dc3fc37a384b1f9e2b379e90a9a3a6 Level Medium- Running 30 
> iterations, 12 at a time .pdf, Lucene-Solr Master Test Beasults 
> 02-01-2017-bbc455de195c83d9f807980b510fa46018f33b1b Level Medium- Running 30 
> iterations, 10 at a time.pdf
>
>
> We have many Jenkins instances blasting tests, some official, some policeman, 
> I and others have or had their own, and the email trail proves the power of 
> the Jenkins cluster to find test fails.
> However, I still have a very hard time with some basic questions:
> what tests are flakey right now? which test fails actually affect devs most? 
> did I break it? was that test already flakey? is that test still flakey? what 
> are our worst tests right now? is that test getting better or worse?
> We really need a way to see exactly what tests are the problem, not because 
> of OS or environmental issues, but more basic test quality issues. Which 
> tests are flakey and how flakey are they at any point in time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-10032) Create report to assess Solr test quality at a commit point.

2017-01-25 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15838489#comment-15838489
 ] 

Mark Miller edited comment on SOLR-10032 at 1/25/17 8:12 PM:
-

I think there is likely too much of a test coverage problem if we take that 
approach.

I'd like to instead push gradually, though perhaps 'Apache time' quickly.

First I will create critical issues for the worst offenders, if they cannot be 
fixed pretty much right away, I will badapple or awaitsfix them.

I'll also create critical issues for other fails above a certain threshold and 
ping appropriate JIRA issues to try and bring attention to them. Over time we 
can ignore these as well if they are not addressed and someone doesn't find 
them important enough to keep coverage.

We can then tighten this net down to a certain level. 

I think if we commit to following through on some progress, we can take an 
iterative approach that gives people ample time to fix important tests and us 
time to evaluate loss of important test coverage (even flakey test coverage is 
very valuable info to us right now, and some flakey tests pass 90%+ of the time 
- we want to harden them, but they provide critical coverage in many cases).

I'll also ping the dev list with a summary occasionally to bring attention to 
this and the current state.


was (Author: markrmil...@gmail.com):
I think there is likely too much of a test coverage problem if we take that 
approach.

I'd like to instead push gradually, though perhaps 'Apache time' quickly.

First I will great critical issues for the worst offenders, if they cannot be 
fixed pretty much right away, I will badapple or awaitsfix them.

I'll also create critical issues for other fails above a certain threshold and 
ping appropriate JIRA issues to try and bring attention to them. Over time we 
can ignore these as well if they are not addressed and someone doesn't find 
them important enough to keep coverage.

We can then tighten this net down to a certain level. 

I think if we commit to following through on some progress, we can take an 
iterative approach that gives people ample time to fix important tests and us 
time to evaluate loss of important test coverage (even flakey test coverage is 
very valuable info to us right now, and some flakey tests pass 90%+ of the time 
- we want to harden them, but they provide critical coverage in many cases).

> Create report to assess Solr test quality at a commit point.
> 
>
> Key: SOLR-10032
> URL: https://issues.apache.org/jira/browse/SOLR-10032
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Mark Miller
>Assignee: Mark Miller
> Attachments: Test-Report-Sample.pdf
>
>
> We have many Jenkins instances blasting tests, some official, some policeman, 
> I and others have or had their own, and the email trail proves the power of 
> the Jenkins cluster to find test fails.
> However, I still have a very hard time with some basic questions:
> what tests are flakey right now? which test fails actually affect devs most? 
> did I break it? was that test already flakey? is that test still flakey? what 
> are our worst tests right now? is that test getting better or worse?
> We really need a way to see exactly what tests are the problem, not because 
> of OS or environmental issues, but more basic test quality issues. Which 
> tests are flakey and how flakey are they at any point in time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org