Hi all, We have a Dublin environment that had a Cassandra problem which caused the graphadmin-create-db-schema job to fail repeatedly.
Question 1. Is there an ONAP-built-in monitoring solution for the shared Cassandra database? The Cassandra problem was a combination of: - node-0 had Out of Memory error in JVM - node-1 had error in commit log processing - node-2 had Out of Disk Space error on filesystem However, Kubernetes showed node-0 and node-2 still "Running" and apparently healthy (zero restarts), with node-1 in CrashLoopBackOff (many restarts). The failed Cassandra nodes caused the graphadmin-create-db-schema job to fail (many restarts), which causes all other AAI pods to wait in Init state. The Cassandra problems have been manually fixed now, but should ONAP have a monitoring solution built-in to detect it? Question 2. How can we re-run that kubernetes job? Deleting the failed job pods cleaned up the list but did not prompt any new job pods to be created. Deleting the graphadmin pod caused it to restart and wait for the job completion, but did not trigger a new job to start. I think the job has now hit its backoff limit, so it no longer runs, even though we have since fixed the Cassandra problem. Can we reset some parameter so that it will finally run to completion? Is there an appropriate helm command or kubectl command to re-run the job? The rest of AAI pods are obviously waiting for that job to complete before they can progress out of Init state. Thanks, Keong -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#18553): https://lists.onap.org/g/onap-discuss/message/18553 Mute This Topic: https://lists.onap.org/mt/32859796/21656 Mute #aai: https://lists.onap.org/mk?hashtag=aai&subid=2740164 Mute #oom: https://lists.onap.org/mk?hashtag=oom&subid=2740164 Mute #dublin: https://lists.onap.org/mk?hashtag=dublin&subid=2740164 Group Owner: [email protected] Unsubscribe: https://lists.onap.org/g/onap-discuss/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
