Hi, I spent some time last week figuring out issues with centos kernel failures which turned out to have been fixed in a recent update that was not applied to some nodes due to build failures.
This prompted me to look a bit more closely at builds with [1]. The results are not great. We are having a lot of failures even in just the centos/fedora builds I've been looking at [2]; with some days most images failing to build. Now I know there's things in motion here. jhesketh is looking at the git timeout issues, which are the major cause of problems (especially note the saturday and sunday jobs go much better than presumably other times when things are under load). I know there is a spec out for better testing of images before deployment which is slightly related. I know there's a change out there for a full REST API in nodepool. Anyway, to avoid more problems like this, I think what I should do now is expand this script to monitor not just centos/fedora and echo the output to the infra-list. Having sentinels in the log files [4] would make this more reliable. That way, we can quickly identify issues with builds without having a manual process of digging through log files, hopefully notice patterns of failure and distribute some of the load of checking on things. -i [1] https://github.com/ianw/nodechecker [2] http://people.redhat.com/~iwienand/nodechecker-output/ [3] https://review.openstack.org/139598 [4] https://review.openstack.org/190889 _______________________________________________ OpenStack-Infra mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
