[onap-tsc] [OOM] Post mortem of passing Honolulu RC0 and consequences

Sylvain Desbureaux via lists.onap.org Fri, 02 Apr 2021 05:01:06 -0700

Hello,

M3/RC0/RC1 path has been long and painful this time, a bit like we had for
Guilin release.
>From my point of view, I see a conjunction of several issues:
* gating that was not used properly by OOM team
* random issues on important components (and gate system) that blurs result
  interpretation
* late submission of code for some projects

tl;dr;

- please release more often. It's easier for everybody to detect regressions
- gate must be more reliable. OOM team will accept patches from core components
only if the patch significantly improve the gate results.
- OOM rules for accepting patches will remain as strict as for RC0:
* all the pods of your component must be UP (actually, all pods should be up
but unfortunately there are some components that fails regularly)
* all healtchecks / e2e tests dealing with your component must be up
- We're working on improving reliability of gate system itself. Big improvements
should arrive in a couple of weeks.

So, as OOM PTL, and in accordance with OOM committers, I propose to take these
actions:

1. Ask teams to release more often

Some teams delivers 1 version bump at M3 and maybe one more after.
Some others have given more than 20 (twenty!) version bumps!
Seeing issues with these versions are obviously a lot faster and "loosing" one
patch in one release is not as important.
So I (re)ask PTLs to try to propose a version bump at least one time per month
so new issues can be discovered as soon as possible.

2. Ask for more reliable behavior on core components

Today, we know that having a "green" gate is very difficult (occurrence to have
an issue on some components tends to be ~1 per gate, so 1 per every 13 tests).
JIRA tickets have been created by the integration team.
Now, OOM team will accept patches from the concerned components only if the
patch significantly improve the gate results.
If not, the patch will be postponed as long as the gate behavior is not better.
My personnal goal would be to have a failure rate of 1 error per 5 gate (on all
tests, meaning 1 error every ~70 tests)

I'd be glad to help on nailing the issue (I'm a good log digger) with component
teams.

3. Be stricter on gating for OOM team

As said, we (the OOM committers) have decided to be stricter when accepting
patches:

* all the pods of your component must be UP (actually, all pods should be up
but unfortunately there are some components that fails regularly)
* all healtchecks / e2e tests dealing with your component must be up

As a consequence, "core" components (AAI, AAF, DMAAP, SDC, SDNC, SO, VID) that
are used in all (most of?) healtcheck tests / end 2 end tests, must have a
"green" gate (100% healthcheck (except full), 100% end 2 end tests)

4. improve gate system reliability in order to have less "false" negative

Today, gate system relies on a lot of components:

* ONAP Gerrit
* ONAP Nexus
* kubernetes cluster on azure
* mqtt
* python microservices
* Gitlab CI
* Gitlab registry
* Gitlab runners
* ansible galaxy repository
* orange OpenStack
* azure kubernetes service

The worst components during last weeks were:

* ansible galaxy (several 500 errors per day)
* Gitlab runners (the shared version had a lot of issues during the last month)
* ONAP Gerrit (but it's a different order of magnitude compared to other)
* azure kubernetes service (it seems that we have too many PVC compared to the
number of kubernetes nodes we're using)

On the other hand, some components seems to be very reliable:

* Kubernetes cluster, mqtt, python microservices, Gitlab registry were almost
100% available
* ONAP Nexus has issues by deleting integration pods but except that works OK
* Orange OpenStack had some issue with storage which leads to failure to deploy
daily instances. I think I've nailed the issue and hope it won't happen again

In order to be more reliable, here's the plan:

- [X] Move from shared runner to dedicated ones on Azure (same cluster, no
cost added). I did it 2 weeks ago and it seems to have improved this.
- [ ] Use a more tolerant python program when dealing with Gitlab. Code is
available and tested on some "Orange" chains. Should be used in the next
weeks.
- [ ] Get rid of Ansible galaxy dependency. gating / daily / weekly / staging
deployments are using ansible. And we're using a "blank" container where we
need to install some "collections" (collection of ansible modules).
Instead of doing that all the time, I'm changing the way it's done so we
"install" once (by creating a Docker) and use many times without need to
access ansible galaxy.
Development is ongoing and I hope to be deployed at same time as previous item
in the next weeks.

I know this is asking a lot on some component but it's really to improve our
way of working and at the end it'll help everybody.

Regards,
Sylvain

_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou
falsifie. Merci.

This message and its attachments may contain confidential or privileged
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been
modified, changed or falsified.
Thank you.

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#7688): https://lists.onap.org/g/onap-tsc/message/7688
Mute This Topic: https://lists.onap.org/mt/81800681/21656
Group Owner: [email protected]
Unsubscribe:
https://lists.onap.org/g/onap-tsc/leave/2743226/21656/1412191262/xyzzy
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

[onap-tsc] [OOM] Post mortem of passing Honolulu RC0 and consequences

Reply via email to