[jira] [Comment Edited] (SOLR-10032) Create report to assess Solr test quality at a commit point.

Mark Miller (JIRA) Wed, 08 Feb 2017 18:27:03 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858889#comment-15858889
 ]


Mark Miller edited comment on SOLR-10032 at 2/9/17 2:26 AM:
------------------------------------------------------------

For this next report I have switched to an 8 core machine from a 16 core 
machine. It looks like that may have made some of the more resource/env 
sensitive tests pop out a little more. The first report was created on a single 
machine, so I went with 16 cores just to try and generate it as fast as 
possible. 16-cores was not strictly needed, I run 10 tests at a time on my 
6-core machine with similar results. It may even be a little too much CPU for 
our use case, even when running 10 instances of the test in parallel.

I have moved on from just using one machine though. It actually basically took 
2-3 days to generate the first report as I was still working out some speed 
issues. The First run had like 2 minutes and 40 seconds of 'build' overtime per 
test run for most of the report and just barely enough RAM to handle 10 tests 
at a time - for a few test fails on heavy tests (eg hdfs), not enough RAM as 
there is also no swap space on those machines. Anyway, beasting ~900 tests is 
time consuming even in the best case. 

Two tests also hung and that slowed things up a bit. Now I am more on the 
lookout for that - I've @BadAppled a test method involved in producing one of 
the hangs, and for this report I locally @BadAppled the other. They both look 
like legit bugs to me. I should have done @Ignore for the second hang, the test 
report runs @BadApple and @AwaitFix.  Losing one machine for a long time when 
you are using 10 costs you a lot in report creation time. Now I at least know 
to pay attention to my email while running reports though. Luckily, these 
instance I'm using will auto pause after 30 minutes of no real activity and I 
get an email, so I now I can be a bit more vigilant while creating the report. 
Also helps that I've gotten down to about 4 hours to create the report.

I used 5 16-core machines for the second report. I can't recall about how long 
that took, but it was still in the realm of an all night job.

For this third report I am using 10 8-core machines.

I think we should be using those annotations like this:

* @AwaitsFix - we basically know something key is broken and it's fairly clear 
what the issue is - we are waiting for someone to fix it - you don't expect 
this to be run regularly, but you can just pass a system property to run them.
* @BadApple - test is too flakey, fails too much for unknown or varied reasons 
- you do expect that some test runs would still or could still include these 
tests and give some useful coverage information - flakiness in many more 
integration type tests can be the result of unrelated issues and clear up over 
time. Or get worse.
* @Ignore - test is never run, it can hang, OOM, or does something negative to 
other tests.

I'll put up another report soon. I probably won't do another one until I have 
tackled the above flakey rating issues, hoping that's just a couple to a few 
weeks at most, but may be wishful.


was (Author: markrmil...@gmail.com):
For this next report I have switched to an 8 core machine from a 16 core 
machine. It looks like that may have made some of the more resource/env 
sensitive tests pop out a little more. The first report was created on a single 
machine, so I went with 16 cores just to try and generate it as fast as 
possible. 16-cores was not strictly needed, I run 10 at a time on my 6-core 
machine with similar results. It may even be a little too much CPU for our use 
case, even when running 10 instances of the test in parallel.

I have moved on from just using one machine though. It actually basically took 
2-3 days to generate the first report as I was still working out some speed 
issues. The First run had like 2 minutes and 40 seconds of 'build' overtime per 
test run for most of the report and just barely enough RAM to handle 10 tests 
at a time - for a few test fails on heavy tests (eg hdfs), not enough RAM as 
there is also no swap space on those machines. Anyway, beasting ~900 tests is 
time consuming even in the best case. 

Two tests also hung and that slowed things up a bit. Now I am more on the 
lookout for that - I've @BadAppled a test method involved in producing one of 
the hangs, and for this report I locally @BadAppled the other. They both look 
like legit bugs to me. I should have done @Ignore for the second hang, the test 
report runs @BadApple and @AwaitFix.  Losing one machine for a long time when 
you are using 10 costs you a lot in report creation time. Now I at least know 
to pay attention to my email while running reports though. Luckily, these 
instance I'm using will auto pause after 30 minutes of no real activity and I 
get an email, so I now I can be a bit more vigilant while creating the report. 
Also helps that I've gotten down to about 

I used 5 16-core machines for the second report. I can't recall about how long 
that took, but it was still in the realm of an all night job.

For this third report I am using 10 8-core machines.

I think we should be using those annotations like this:

* @AwaitsFix - we basically know something key is broken and it's fairly clear 
what the issue is - we are waiting for someone to fix it - you don't expect 
this to be run regularly, but you can just pass a system property to run them.
* @BadApple - test is too flakey, fails too much for unknown or varied reasons 
- you do expect that some test runs would still or could still include these 
tests and give some useful coverage information - flakiness in many more 
integration type tests can be the result of unrelated issues and clear up over 
time. Or get worse.
* @Ignore - test is never run, it can hang, OOM, or does something negative to 
other tests.

I'll put up another report soon. I probably won't do another one until I have 
tackled the above flakey rating issues, hoping that's just a couple to a few 
weeks at most, but may be wishful.

> Create report to assess Solr test quality at a commit point.
> ------------------------------------------------------------
>
>                 Key: SOLR-10032
>                 URL: https://issues.apache.org/jira/browse/SOLR-10032
>             Project: Solr
>          Issue Type: Task
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Tests
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>         Attachments: Lucene-Solr Master Test Beast Results 
> 01-24-2017-9899cbd031dc3fc37a384b1f9e2b379e90a9a3a6 Level Medium- Running 30 
> iterations, 12 at a time .pdf, Lucene-Solr Master Test Beasults 
> 02-01-2017-bbc455de195c83d9f807980b510fa46018f33b1b Level Medium- Running 30 
> iterations, 10 at a time.pdf
>
>
> We have many Jenkins instances blasting tests, some official, some policeman, 
> I and others have or had their own, and the email trail proves the power of 
> the Jenkins cluster to find test fails.
> However, I still have a very hard time with some basic questions:
> what tests are flakey right now? which test fails actually affect devs most? 
> did I break it? was that test already flakey? is that test still flakey? what 
> are our worst tests right now? is that test getting better or worse?
> We really need a way to see exactly what tests are the problem, not because 
> of OS or environmental issues, but more basic test quality issues. Which 
> tests are flakey and how flakey are they at any point in time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-10032) Create report to assess Solr test quality at a commit point.

Reply via email to