On 07/03/2014 01:27 PM, Kevin Benton wrote: >> This allows the viewer to see categories of reviews based upon their >> divergence from OpenStack's Jenkins results. I think evaluating >> divergence from Jenkins might be a metric worth consideration. > > I think the only thing this really reflects though is how much the third > party CI system is mirroring Jenkins. > A system that frequently diverges may be functioning perfectly fine and > just has a vastly different code path that it is integration testing so it > is legitimately detecting failures the OpenStack CI cannot. Great.
How do we measure the degree to which it is legitimately detecting failures? Thanks Kevin, Anita. > > -- > Kevin Benton > > > On Thu, Jul 3, 2014 at 6:49 AM, Anita Kuno <[email protected]> wrote: > >> On 07/03/2014 07:12 AM, Salvatore Orlando wrote: >>> Apologies for quoting again the top post of the thread. >>> >>> Comments inline (mostly thinking aloud) >>> Salvatore >>> >>> >>> On 30 June 2014 22:22, Jay Pipes <[email protected]> wrote: >>> >>>> Hi Stackers, >>>> >>>> Some recent ML threads [1] and a hot IRC meeting today [2] brought up >> some >>>> legitimate questions around how a newly-proposed Stackalytics report >> page >>>> for Neutron External CI systems [2] represented the results of an >> external >>>> CI system as "successful" or not. >>>> >>>> First, I want to say that Ilya and all those involved in the >> Stackalytics >>>> program simply want to provide the most accurate information to >> developers >>>> in a format that is easily consumed. While there need to be some >> changes in >>>> how data is shown (and the wording of things like "Tests Succeeded"), I >>>> hope that the community knows there isn't any ill intent on the part of >>>> Mirantis or anyone who works on Stackalytics. OK, so let's keep the >>>> conversation civil -- we're all working towards the same goals of >>>> transparency and accuracy. :) >>>> >>>> Alright, now, Anita and Kurt Taylor were asking a very poignant >> question: >>>> >>>> "But what does CI tested really mean? just running tests? or tested to >>>> pass some level of requirements?" >>>> >>>> In this nascent world of external CI systems, we have a set of issues >> that >>>> we need to resolve: >>>> >>>> 1) All of the CI systems are different. >>>> >>>> Some run Bash scripts. Some run Jenkins slaves and devstack-gate >> scripts. >>>> Others run custom Python code that spawns VMs and publishes logs to some >>>> public domain. >>>> >>>> As a community, we need to decide whether it is worth putting in the >>>> effort to create a single, unified, installable and runnable CI system, >> so >>>> that we can legitimately say "all of the external systems are identical, >>>> with the exception of the driver code for vendor X being substituted in >> the >>>> Neutron codebase." >>>> >>> >>> I think such system already exists, and it's documented here: >>> http://ci.openstack.org/ >>> Still, understanding it is quite a learning curve, and running it is not >>> exactly straightforward. But I guess that's pretty much understandable >>> given the complexity of the system, isn't it? >>> >>> >>>> >>>> If the goal of the external CI systems is to produce reliable, >> consistent >>>> results, I feel the answer to the above is "yes", but I'm interested to >>>> hear what others think. Frankly, in the world of benchmarks, it would be >>>> unthinkable to say "go ahead and everyone run your own benchmark suite", >>>> because you would get wildly different results. A similar problem has >>>> emerged here. >>>> >>> >>> I don't think the particular infrastructure which might range from an >>> openstack-ci clone to a 100-line bash script would have an impact on the >>> "reliability" of the quality assessment regarding a particular driver or >>> plugin. This is determined, in my opinion, by the quantity and nature of >>> tests one runs on a specific driver. In Neutron for instance, there is a >>> wide range of choices - from a few test cases in tempest.api.network to >> the >>> full smoketest job. As long there is no minimal standard here, then it >>> would be difficult to assess the quality of the evaluation from a CI >>> system, unless we explicitly keep into account coverage into the >> evaluation. >>> >>> On the other hand, different CI infrastructures will have different >> levels >>> in terms of % of patches tested and % of infrastructure failures. I think >>> it might not be a terrible idea to use these parameters to evaluate how >>> good a CI is from an infra standpoint. However, there are still open >>> questions. For instance, a CI might have a low patch % score because it >>> only needs to test patches affecting a given driver. >>> >>> >>>> 2) There is no mediation or verification that the external CI system is >>>> actually testing anything at all >>>> >>>> As a community, we need to decide whether the current system of >>>> self-policing should continue. If it should, then language on reports >> like >>>> [3] should be very clear that any numbers derived from such systems >> should >>>> be taken with a grain of salt. Use of the word "Success" should be >> avoided, >>>> as it has connotations (in English, at least) that the result has been >>>> verified, which is simply not the case as long as no verification or >>>> mediation occurs for any external CI system. >>>> >>> >>> >>> >>> >>>> 3) There is no clear indication of what tests are being run, and >> therefore >>>> there is no clear indication of what "success" is >>>> >>>> I think we can all agree that a test has three possible outcomes: pass, >>>> fail, and skip. The results of a test suite run therefore is nothing >> more >>>> than the aggregation of which tests passed, which failed, and which were >>>> skipped. >>>> >>>> As a community, we must document, for each project, what are expected >> set >>>> of tests that must be run for each merged patch into the project's >> source >>>> tree. This documentation should be discoverable so that reports like [3] >>>> can be crystal-clear on what the data shown actually means. The report >> is >>>> simply displaying the data it receives from Gerrit. The community needs >> to >>>> be proactive in saying "this is what is expected to be tested." This >> alone >>>> would allow the report to give information such as "External CI system >> ABC >>>> performed the expected tests. X tests passed. Y tests failed. Z tests >> were >>>> skipped." Likewise, it would also make it possible for the report to >> give >>>> information such as "External CI system DEF did not perform the expected >>>> tests.", which is excellent information in and of itself. >>>> >>>> >>> Agreed. In Neutron we have enforced CIs but not yet agreed on what's the >>> minimum set of tests we expect them to run. I reckon this will be fixed >>> soon. >>> >>> I'll try to look at what "SUCCESS" is from a naive standpoint: a CI says >>> "SUCCESS" if the test suite it rans passed; then one should have means to >>> understand whether a CI might blatantly lie or tell "half truths". For >>> instance saying it passes tempest.api.network while >>> tempest.scenario.test_network_basic_ops has not been executed is a half >>> truth, in my opinion. >>> Stackalitycs can help here, I think. One could create "CI classes" >>> according to how much they're close to the level of the upstream gate, >> and >>> then parse results posted to classify CIs. Now, before cursing me, I >>> totally understand that this won't be easy at all to implement! >>> Furthermore, I don't know whether how this should be reflected in gerrit. >>> >>> >>>> === >>>> >>>> In thinking about the likely answers to the above questions, I believe >> it >>>> would be prudent to change the Stackalytics report in question [3] in >> the >>>> following ways: >>>> >>>> a. Change the "Success %" column header to "% Reported +1 Votes" >>>> b. Change the phrase " Green cell - tests ran successfully, red cell - >>>> tests failed" to "Green cell - System voted +1, red cell - System voted >> -1" >>>> >>> >>> That makes sense to me. >>> >>> >>>> >>>> and then, when we have more and better data (for example, # tests >> passed, >>>> failed, skipped, etc), we can provide more detailed information than >> just >>>> "reported +1" or not. >>>> >>> >>> I think it should not be too hard to start adding minimal measures such >> as >>> "% of voted patches" >>> >>>> >>>> Thoughts? >>>> >>>> Best, >>>> -jay >>>> >>>> [1] http://lists.openstack.org/pipermail/openstack-dev/2014- >>>> June/038933.html >>>> [2] http://eavesdrop.openstack.org/meetings/third_party/2014/ >>>> third_party.2014-06-30-18.01.log.html >>>> [3] http://stackalytics.com/report/ci/neutron/7 >>>> >>>> _______________________________________________ >>>> OpenStack-dev mailing list >>>> [email protected] >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>> >>> >>> >>> >>> _______________________________________________ >>> OpenStack-dev mailing list >>> [email protected] >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >> Thanks for sharing your thoughts, Salvadore. >> >> Some additional things to look at: >> >> Sean Dague has created a tool in stackforge gerrit-dash-creator: >> >> http://git.openstack.org/cgit/stackforge/gerrit-dash-creator/tree/README.rst >> which has the ability to make interesting queries on gerrit results. One >> such example can be found here: http://paste.openstack.org/show/85416/ >> (Note when this url was created there was a bug in the syntax and this >> url works in chrome but not firefox, Sean tells me the firefox bug has >> been addressed - though this url hasn't been altered with the new syntax >> yet) >> >> This allows the viewer to see categories of reviews based upon their >> divergence from OpenStack's Jenkins results. I think evaluating >> divergence from Jenkins might be a metric worth consideration. >> >> Also a gui representation worth looking at is Mikal Still's gui for >> Neutron ci health: >> http://www.rcbops.com/gerrit/reports/neutron-cireport.html >> and Nova ci health: >> http://www.rcbops.com/gerrit/reports/nova-cireport.html >> >> I don't know the details of how the graphs are calculated in these >> pages, but being able to view passed/failed/missed and compare them to >> Jenkins is an interesting approach and I feel has some merit. >> >> Thanks I think we are getting some good information out in this thread >> and look forward to hearing more thoughts. >> >> Thank you, >> Anita. >> >> _______________________________________________ >> OpenStack-dev mailing list >> [email protected] >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > > > > > > _______________________________________________ > OpenStack-dev mailing list > [email protected] > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > _______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
