[onap-discuss] Cassandra problem and Kubernetes job problem #aai #oom #dublin

Keong Lim Tue, 13 Aug 2019 18:43:24 -0700

Hi all,

We have a Dublin environment that had a Cassandra problem which caused the 
graphadmin-create-db-schema job to fail repeatedly.


Question 1. Is there an ONAP-built-in monitoring solution for the shared 
Cassandra database?

The Cassandra problem was a combination of:
- node-0 had Out of Memory error in JVM
- node-1 had error in commit log processing
- node-2 had Out of Disk Space error on filesystem

However, Kubernetes showed node-0 and node-2 still "Running" and apparently 
healthy (zero restarts), with node-1 in CrashLoopBackOff (many restarts).
The failed Cassandra nodes caused the graphadmin-create-db-schema job to fail 
(many restarts), which causes all other AAI pods to wait in Init state.

The Cassandra problems have been manually fixed now, but should ONAP have a 
monitoring solution built-in to detect it?


Question 2. How can we re-run that kubernetes job?

Deleting the failed job pods cleaned up the list but did not prompt any new job 
pods to be created.
Deleting the graphadmin pod caused it to restart and wait for the job 
completion, but did not trigger a new job to start.

I think the job has now hit its backoff limit, so it no longer runs, even 
though we have since fixed the Cassandra problem.
Can we reset some parameter so that it will finally run to completion?
Is there an appropriate helm command or kubectl command to re-run the job?

The rest of AAI pods are obviously waiting for that job to complete before they 
can progress out of Init state.


Thanks,
Keong

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#18553): https://lists.onap.org/g/onap-discuss/message/18553
Mute This Topic: https://lists.onap.org/mt/32859796/21656
Mute #aai: https://lists.onap.org/mk?hashtag=aai&subid=2740164
Mute #oom: https://lists.onap.org/mk?hashtag=oom&subid=2740164
Mute #dublin: https://lists.onap.org/mk?hashtag=dublin&subid=2740164
Group Owner: [email protected]
Unsubscribe: https://lists.onap.org/g/onap-discuss/unsub  
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

[onap-discuss] Cassandra problem and Kubernetes job problem #aai #oom #dublin

Reply via email to