good summary I am in line with the observation and support the proposals to improve the reliability/trustability
from an integration perspective, I also missed something when our daily Master clearly showed regression beginning of the year. I should have raised the flag quicker but it is hard to deal with version N-1, N and work on the requirements of N+1 if not N+2 :) that is also why, for me, the gating is the only solution to consolidate the solution. As part of the improvement from integration I would also suggest - add gating on the tests! any test change (robot/secu/pythonsdk) will be tested before merged as we do for SO or CLAMP today , it would avoid regression as we had for 3-4 days due to the onapsdk upstream upgrade beginning of February. It is not a top priority as usually we detect relatively quickly such regression :) - share with the PTL before the M3 as I do with Integration team the CI/CD test evolution I usually indicate the current CI/CD chains and what we plan to add/deprecate in the list of available automated tests used in the different chains (gating/daily/Weekly) ~ automated test criteria for the release The PTL could challenge the choice of the tests and/or suggest additional ones to provide a better coverage of the features. Happy Easter! /Morgan ________________________________ De : [email protected] [[email protected]] de la part de Sylvain Desbureaux via lists.onap.org [[email protected]] Envoyé : vendredi 2 avril 2021 14:00 À : [email protected] Cc : onap-tsc Objet : [Onap-release] [OOM] Post mortem of passing Honolulu RC0 and consequences Hello, M3/RC0/RC1 path has been long and painful this time, a bit like we had for Guilin release. >From my point of view, I see a conjunction of several issues: * gating that was not used properly by OOM team * random issues on important components (and gate system) that blurs result interpretation * late submission of code for some projects tl;dr; - please release more often. It's easier for everybody to detect regressions - gate must be more reliable. OOM team will accept patches from core components only if the patch significantly improve the gate results. - OOM rules for accepting patches will remain as strict as for RC0: * all the pods of your component must be UP (actually, all pods should be up but unfortunately there are some components that fails regularly) * all healtchecks / e2e tests dealing with your component must be up - We're working on improving reliability of gate system itself. Big improvements should arrive in a couple of weeks. So, as OOM PTL, and in accordance with OOM committers, I propose to take these actions: 1. Ask teams to release more often Some teams delivers 1 version bump at M3 and maybe one more after. Some others have given more than 20 (twenty!) version bumps! Seeing issues with these versions are obviously a lot faster and "loosing" one patch in one release is not as important. So I (re)ask PTLs to try to propose a version bump at least one time per month so new issues can be discovered as soon as possible. 2. Ask for more reliable behavior on core components Today, we know that having a "green" gate is very difficult (occurrence to have an issue on some components tends to be ~1 per gate, so 1 per every 13 tests). JIRA tickets have been created by the integration team. Now, OOM team will accept patches from the concerned components only if the patch significantly improve the gate results. If not, the patch will be postponed as long as the gate behavior is not better. My personnal goal would be to have a failure rate of 1 error per 5 gate (on all tests, meaning 1 error every ~70 tests) I'd be glad to help on nailing the issue (I'm a good log digger) with component teams. 3. Be stricter on gating for OOM team As said, we (the OOM committers) have decided to be stricter when accepting patches: * all the pods of your component must be UP (actually, all pods should be up but unfortunately there are some components that fails regularly) * all healtchecks / e2e tests dealing with your component must be up As a consequence, "core" components (AAI, AAF, DMAAP, SDC, SDNC, SO, VID) that are used in all (most of?) healtcheck tests / end 2 end tests, must have a "green" gate (100% healthcheck (except full), 100% end 2 end tests) 4. improve gate system reliability in order to have less "false" negative Today, gate system relies on a lot of components: * ONAP Gerrit * ONAP Nexus * kubernetes cluster on azure * mqtt * python microservices * Gitlab CI * Gitlab registry * Gitlab runners * ansible galaxy repository * orange OpenStack * azure kubernetes service The worst components during last weeks were: * ansible galaxy (several 500 errors per day) * Gitlab runners (the shared version had a lot of issues during the last month) * ONAP Gerrit (but it's a different order of magnitude compared to other) * azure kubernetes service (it seems that we have too many PVC compared to the number of kubernetes nodes we're using) On the other hand, some components seems to be very reliable: * Kubernetes cluster, mqtt, python microservices, Gitlab registry were almost 100% available * ONAP Nexus has issues by deleting integration pods but except that works OK * Orange OpenStack had some issue with storage which leads to failure to deploy daily instances. I think I've nailed the issue and hope it won't happen again In order to be more reliable, here's the plan: - [X] Move from shared runner to dedicated ones on Azure (same cluster, no cost added). I did it 2 weeks ago and it seems to have improved this. - [ ] Use a more tolerant python program when dealing with Gitlab. Code is available and tested on some "Orange" chains. Should be used in the next weeks. - [ ] Get rid of Ansible galaxy dependency. gating / daily / weekly / staging deployments are using ansible. And we're using a "blank" container where we need to install some "collections" (collection of ansible modules). Instead of doing that all the time, I'm changing the way it's done so we "install" once (by creating a Docker) and use many times without need to access ansible galaxy. Development is ongoing and I hope to be deployed at same time as previous item in the next weeks. I know this is asking a lot on some component but it's really to improve our way of working and at the end it'll help everybody. Regards, Sylvain _________________________________________________________________________________________________________________________ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified. Thank you. _________________________________________________________________________________________________________________________ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified. Thank you. -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#7694): https://lists.onap.org/g/onap-tsc/message/7694 Mute This Topic: https://lists.onap.org/mt/81800681/21656 Group Owner: [email protected] Unsubscribe: https://lists.onap.org/g/onap-tsc/leave/2743226/21656/1412191262/xyzzy [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
