[GitHub] [ozone] adoroszlai opened a new pull request, #5028: HDDS-7645. Kubernetes check should fail fast if cluster cannot start

via GitHub Thu, 06 Jul 2023 00:47:51 -0700


adoroszlai opened a new pull request, #5028:
URL: https://github.com/apache/ozone/pull/5028


   ## What changes were proposed in this pull request?
   
   Kubernetes check currently proceeds to execute tests even if cluster is not 
able to start up.  It should exit without trying to run the tests.
   
   ```
   **** Waiting until the k8s cluster is running ****
   
   ...
   4 pods are running out from the 5
   100 'all_pods_are_running' is failed...
   
   **** Executing robot tests scm-0 ****
   
   Defaulted container "scm" out of: scm, init (init)
   Unable to use a TTY - input is not a terminal or the right kind of file
   error: unable to upgrade connection: container not found ("scm")
   ```
   
   This change skips the tests if cluster fails to start up.
   
   Also:
    * Fix `-1 pods are running` message (due to hard-coded subtraction intended 
to account for the header row of `kubectl get pod`'s output)
    * Allow custom number of retry attempts (for easier testing)
    * Reduce code duplication in test scripts
   
   https://issues.apache.org/jira/browse/HDDS-7645
   
   ## How was this patch tested?
   
   Triggered cluster startup "error" by setting low number of retry attempts.  
Verified tests are not attempted, logs are collected, cluster is shut down:
   
   ```
   $ RETRY_ATTEMPTS=5 OZONE_TEST_SELECTOR=getting-started 
./hadoop-ozone/dev-support/checks/kubernetes.sh
   ...
   
   **** Applying k8s resources from getting-started ****
   
   ...
   
   **** Waiting until the k8s cluster is running ****
   
   No resources found in default namespace.
   0 pods are running. Waiting for more.
   1 'all_pods_are_running' is failed...
   3 pods are running out from the 5
   2 'all_pods_are_running' is failed...
   5 pods are running out from the 6
   3 'all_pods_are_running' is failed...
   1 'grep_log scm-0 SCM exiting safe mode.' is failed...
   2 'grep_log scm-0 SCM exiting safe mode.' is failed...
   3 'grep_log scm-0 SCM exiting safe mode.' is failed...
   4 'grep_log scm-0 SCM exiting safe mode.' is failed...
   5 'grep_log scm-0 SCM exiting safe mode.' is failed...
   
   **** Collecting container logs ****
   
   
   **** Deleting k8s resources ****
   
   configmap "config" deleted
   service "datanode" deleted
   service "datanode-public" deleted
   service "om" deleted
   service "om-public" deleted
   service "s3g" deleted
   service "s3g-public" deleted
   service "scm" deleted
   service "scm-public" deleted
   statefulset.apps "datanode" deleted
   statefulset.apps "om" deleted
   statefulset.apps "s3g" deleted
   statefulset.apps "scm" deleted
   
   $ ls -1 
hadoop-ozone/dist/target/ozone-1.4.0-SNAPSHOT/kubernetes/examples/getting-started/logs
   pod-datanode-0.log
   pod-datanode-1.log
   pod-datanode-2.log
   pod-om-0.log
   pod-s3g-0.log
   pod-scm-0-init.log
   pod-scm-0.log
   ```
   
   With even fewer retries:
   
   ```
   $ RETRY_ATTEMPTS=2 OZONE_TEST_SELECTOR=getting-started 
./hadoop-ozone/dev-support/checks/kubernetes.sh  
   ...
   
   **** Waiting until the k8s cluster is running ****
   
   No resources found in default namespace.
   0 pods are running. Waiting for more.
   1 'all_pods_are_running' is failed...
   3 pods are running out from the 5
   2 'all_pods_are_running' is failed...
   
   **** Collecting container logs ****
   
   ...
   ```
   
   Regular CI:
   https://github.com/adoroszlai/hadoop-ozone/actions/runs/5472252928


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [ozone] adoroszlai opened a new pull request, #5028: HDDS-7645. Kubernetes check should fail fast if cluster cannot start

Reply via email to