Ops... I forgot to mention that in agreement with sdague we won't anyway enable this job before thursday June 26th, in order to give a few days to the trusty update to settle down.
Salvatore On 24 June 2014 14:14, Salvatore Orlando <sorla...@nicira.com> wrote: > There is a long standing patch [1] for enabling the neutron full job. > Little before the Icehouse release date, when we first pushed this, the > neutron full job had a failure rate of less than 10%. However, since has > come by, and perceived failure rates were higher, we ran again this > analysis. > > Here are the findings in a nutshell. > 1) If we were to enable the job today we might expect about a 3-fold > increase in neutron job failures when compared with the smoke test. This is > unfortunately not acceptable and we therefore need to identify and fix the > issues causing the additional failure rate. > 2) However this also puts us in a position where if we wait until the > failure rate drops under a given threshold we might end up chasing a moving > target as new issues might be introduced at any time since the job is not > voting. > 3) When it comes to evaluating failure rates for a non voting job, taking > the rough numbers does not mean anything, as that will take in account > patches 'in progress' which end up failing the tests because of problems in > the patch themselves. > > Well, that was pretty much a lot for a "nutshell"; however if you're not > yet bored to death please go on reading. > > The data in this post are a bit skewed because of a rise in neutron job > failures in the past 36 hours. However, this rise affects both the full and > the smoke job so it does not invalidate what we say here. The results shown > below are representative of the gate status 12 hours ago. > > - Neutron smoke job failure rates (all queues) > 24 hours: 22.4% 48 hours: 19.3% 7 days: 8.96% > - Neutron smoke job failure rates (gate queue only): > 24 hours: 10.41% 48 hours: 10.20% 7 days: 3.53% > - Neutron full job failure rate (check queue only as it's non voting): > 24 hours: 31.54% 48 hours: 28.87% 7 days: 25.73% > > Check/Gate Ratio between neutron smoke failures > 24 hours: 2.15 48 hours: 1.89 7 days: 2.53 > > Estimated job failure rate for neutron full job if it were to run in the > gate: > 24 hours: 14.67% 48 hours: 15.27% 7 days: 10.16% > > The numbers are therefore not terrible, but definitely not good enough; > looking at the last 7 days the full job will have a failure rate about 3 > times higher than the smoke job. > > We then took, as it's usual for us when we do this kind of evaluation, a > window with a reasonable number of failures (41 in our case), and analysed > them in detail. > > Of these 41 failures 17 were excluded because of infra problems, patches > 'in progress', or other transient failures; considering that over the same > period of time 160 full job runs succeeded this would leave us with 24 > failures on 184 run, and therefore a failure rate of 13.04%, which not far > from the estimate. > > Let's consider now these 24 'real' falures: > A) 2 were for the SSH timeout (8.33% of failures, 1.08% of total full job > runs). These specific failure is being analyzed to see if a specific > fingerprint can be found > B) 2 (8.33% of failures, 1.08% of total full job runs) were for a failure > in test load balancer basic, which is actually a test design issue and is > already being addressed [2] > C) 7 (29.16% of failures, 3.81% of total full job runs) were for an issue > while resizing a server, which has been already spotted and has a bug in > progress [3] > D) 5 (20.83% of failures, 2.72% of total full job runs) manifested as a > failure in test_server_address; however the actual root cause was being > masked by [4]. A bug has been filed [5]; this is the most worrying one in > my opinion as there are many cases where the fault happens but does not > trigger a failure because of the way tempest tests are designed. > E) 6 are because of our friend lock wait timeout. This was initially filed > as [6] but since then we've closed it to file more detailed bug reports as > the lock wait timeout can manifest in various places; Eugene is leading the > effort on this problem with Kevin B. > > > Summarizing the only failure modes specific to the full job seem to be C & > D. If we were able to fix those we should reasonably expect a failure rate > of about 6.5%. That's still almost twice as the smoke job, but I deem it > acceptable for two reasons: > 1- by voting, we will avoid new bugs affecting the full job from being > introduced. it is worth reminding people that any bug affecting the full > job is likely to affect production environments > 2- patches failing in the gate will spur neutron developers to quickly > find a fix. Patches failing a non voting job will cause some neutron core > team members to write long and boring posts to the mailing list. > > Salvatore > > > > > [1] https://review.openstack.org/#/c/88289/ > [2] https://review.openstack.org/#/c/98065/ > [3] https://bugs.launchpad.net/nova/+bug/1329546 > [4] https://bugs.launchpad.net/tempest/+bug/1332414 > [5] https://bugs.launchpad.net/nova/+bug/1333654 > [5] https://bugs.launchpad.net/nova/+bug/1283522 >
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev