Hi

we reenabled gating today (on master)
apparently there is a problem in OOM, we got systematically the same error (see 
Sylvain's mail)
Something bad has been probably merged

Sylvain probably found the root cause of our errors.
We got lots of error/warning messages from etcd claiming that it was too slow.
The installation now imposed to use SSD disk for etc VM (previously, no 
constraints were indicated).
Since this change has been applied we do not have anymore error messages , the 
deployment of ONAP seems faster (only dmaap and so remains a little bit long) 
and at the end more deterministic.
It is probably our migration to 3 etc VM that caused the issues on our 
different chains.

We tested on our Daily master and on the Openlab (ciommunity lab based on 
Dublin).
I sent the mail to our active Openlab users, they should be able to connect and 
try Dublin.

As discussed we will be off next week and be back in August.
We disabled our daily chains during our PTO. We only keep the openlab + Gating.

/Morgan
 
________________________________________
De : Lefevre, Catherine [[email protected]]
Envoyé : jeudi 11 juillet 2019 12:00
À : RICHOMME Morgan TGI/OLN; FREEMAN, BRIAN D; [email protected]; 
[email protected]; [email protected]; [email protected]
Cc : DEBEAU Eric TGI/OLN; DESBUREAUX Sylvain TGI/OLN
Objet : RE: [ONAP][Gating] Gating chains momentarily disabled

Thank you Morgan for the update.
I will maintain our today's sync-up call so you can share your latest updates

KR
Catherine

-----Original Message-----
From: [email protected] [mailto:[email protected]]
Sent: Thursday, July 11, 2019 11:18 AM
To: FREEMAN, BRIAN D <[email protected]>; [email protected]; Lefevre, 
Catherine <[email protected]>; [email protected]; 
[email protected]; [email protected]
Cc: DEBEAU Eric TGI/OLN <[email protected]>; DESBUREAUX Sylvain TGI/OLN 
<[email protected]>
Subject: [ONAP][Gating] Gating chains momentarily disabled

Hi,

we momentarily stopped the gating chain on Orange Openlab.
We are facing weird behaviors since some days.
Initially we believed it could be due to ONAP but latest investigations seem to 
indicate that it is a kubernetes issue.

Our gating chains include a 3 controllers + 12 compute nodes configurations.
For some reasons we still ignore, sometimes some compute nodes (usually
1 or 2 on the 12) are badly configured. The IPvS (~ IP Tables to manage the 
routing within the k8s clusters are incomplete)

Apparently it is not due to the CNI (we tried with weave and flanel).

As a consequence, some ONAP components cannot contact other ONAP components if 
they are on a wrongly configured compute nodes.
The delete of the pod may lead to restoration if by chance the pod is 
rescheduled on a healthy node.
It could explain the intermittent problems we reported - and why it was hard to 
reproduce the issues.

For gating and our daily chains, the consequence is an abnormal number of 
failed pods - pods should be OK but as they cannot contact pods they depend on, 
init is failing.

We need to understand the root cause of the problem and see what changed over 
the last few days. By default we run a simple healthcheck test suite after 
kubernetes installation, it is probably not enough.

Meanwhile to avoid any misleading reporting on gating, we disabled the listener 
on gerrit.

Sorry for the inconvenience

Morgan & Sylvain

_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites 
ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez 
le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les 
messages electroniques etant susceptibles d'alteration, Orange decline toute 
responsabilite si ce message a ete altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law; they should not be distributed, used 
or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.
Thank you.


_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.
Thank you.


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#18039): https://lists.onap.org/g/onap-discuss/message/18039
Mute This Topic: https://lists.onap.org/mt/32428410/21656
Group Owner: [email protected]
Unsubscribe: https://lists.onap.org/g/onap-discuss/unsub  
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to