[GitHub] [cloudstack] saffronjam opened a new issue, #7829: Kubernetes cluster sometimes end up in Error state: fails to attach Binary ISO

via GitHub Tue, 08 Aug 2023 06:34:06 -0700


saffronjam opened a new issue, #7829:
URL: https://github.com/apache/cloudstack/issues/7829


   ##### ISSUE TYPE
    * Bug Report
   
   ##### COMPONENT NAME
   ~~~
   CKS
   ~~~
   
   ##### CLOUDSTACK VERSION
   ~~~
   4.18
   ~~~
   
   ##### CONFIGURATION
   CKS 1.24.0 and 1.27.3 
   
   (also 1.26.0, I experienced the same problems, but I did not include it in 
this bug report,)
   
   I used 6 different setups using 2 service offerings with the 2 CKS-versions 
specified above.
   Every setup had 1 controller node and the default Kubernetes isolated 
network offering. 
   
   SO1 (small):
   - 1 CPU cores
   - 2 GB RAM
   - 8 GB root disk
   
   SO2 (big):
   - 4 CPU cores
   - 16 GB RAM
   - 64 GB root disk
   
   ```
   test-124-small
   
   1 worker
   SO 1
   CKS 1.24.0
   ```
   
   ```
   test-124-small-many
   
   10 workers
   SO 1
   CKS 1.24.0
   ```
   
   ```
   test-1273-small
   
   1 worker
   SO 1
   CKS 1.27.3
   ```
   
   ```
   test-1273-small-many
   
   10 workers
   SO 1
   CKS 1.27.3
   ```
   
   ```
   test-1273-big
   
   1 worker
   SO 2
   CKS 1.27.3
   ```
   
   ```
   test-1273-big-many
   
   10 workers
   SO 2
   CKS 1.27.3
   ```
   
   ##### OS / ENVIRONMENT
   Ubuntu 22.04 nodes running KVM
   
   ##### SUMMARY
   When creating Kubernetes clusters, it fails some of the times. It appears as 
if it is more common when creating multiple nodes instead of just a few.
   
   I can access the nodes in the failed clusters through ssh just fine and I am 
able to look at the logs, such as /var/log/daemon.log.
   
   In the _test-1273-big-many-node1_ the entries in daemon.log indicates that 
the issue is related to some binary ISO not being attached.
   ```
   Aug  8 09:30:21 systemvm cloud-init[1102]: Waiting for Binaries directory 
/mnt/k8sdisk/ to be available, sleeping for 15 seconds, attempt: 99
   Aug  8 09:30:36 systemvm cloud-init[1102]: Waiting for Binaries directory 
/mnt/k8sdisk/ to be available, sleeping for 15 seconds, attempt: 100
   Aug  8 09:30:51 systemvm cloud-init[1102]: Warning: Offline install timed 
out!
   
   which I assume could cause the following entry:
   
   Aug  8 09:30:52 systemvm deploy-kube-system[1420]: 
/opt/bin/deploy-kube-system: line 19: kubeadm: command not found
   ```
   
   But that is not the case for every failed cluster, such as 
_test-small-many_, actually succeeds on some nodes:
   ```
   Aug  8 09:43:10 systemvm cloud-init[1070]: Waiting for Binaries directory 
/mnt/k8sdisk/ to be available, sleeping for 30 seconds, attempt: 8
   Aug  8 09:43:41 systemvm cloud-init[1070]: Waiting for Binaries directory 
/mnt/k8sdisk/ to be available, sleeping for 30 seconds, attempt: 9
   Aug  8 09:44:11 systemvm cloud-init[1070]: Installing binaries from 
/mnt/k8sdisk/
   ```
   But fails on the 5th node out of 10 total with the same error
   ```
   Aug  8 10:00:16 systemvm cloud-init[1072]: Waiting for Binaries directory 
/mnt/k8sdisk/ to be available, sleeping for 30 seconds, attempt: 39
   Aug  8 10:00:46 systemvm cloud-init[1072]: Waiting for Binaries directory 
/mnt/k8sdisk/ to be available, sleeping for 30 seconds, attempt: 40
   Aug  8 10:01:16 systemvm cloud-init[1072]: Warning: Offline install timed 
out!
   
   with the kubectl output:
   
   ➜  ~ kubectl get nodes                                      
   NAME                                       STATUS   ROLES           AGE    
VERSION
   test-1273-small-many-control-189d4833e43   Ready    control-plane   3h7m   
v1.27.3
   test-1273-small-many-node-189d483a95b      Ready    <none>          3h7m   
v1.27.3
   test-1273-small-many-node-189d484150e      Ready    <none>          3h7m   
v1.27.3
   test-1273-small-many-node-189d4847014      Ready    <none>          3h7m   
v1.27.3
   test-1273-small-many-node-189d484b721      Ready    <none>          3h7m   
v1.27.3
   ....
   should be 6 more!
   ```
   
   
   All logs can be found at the bottom of this issue
   
   ##### STEPS TO REPRODUCE
   ~~~
   1. Create a cluster with one of the configurations above
   2. Wait for it to be created
   3. Check if it worked or not
   ~~~
   
   ##### EXPECTED RESULTS
   ~~~
   Cluster ends up in Running state 
   ~~~
   
   ##### ACTUAL RESULTS
   ~~~
   Cluster ends up in Error state
   ~~~
   
   ##### LOGS
   I supplied logs from /var/log/daemon.log. I added logs from the controller 
and the first worker.
   
   For the clusters with more than 1 worker, I added the logs for the first 
worker that failed as well.
   
   **SUCCEEDED**
   
[test-124-small-controller.txt](https://github.com/apache/cloudstack/files/12291211/test-124-small-controller.txt)
   
[test-124-small-node1.txt](https://github.com/apache/cloudstack/files/12291215/test-124-small-node1.txt)
   
   **FAILED**
   
[test-124-small-many-controller.txt](https://github.com/apache/cloudstack/files/12291212/test-124-small-many-controller.txt)
   
[test-124-small-many-node1.txt](https://github.com/apache/cloudstack/files/12291214/test-124-small-many-node1.txt)
   
[test-124-small-many-node2.txt](https://github.com/apache/cloudstack/files/12291652/test-124-small-many-node2.txt)
   
   **FAILED**
   
[test-1273-small-controller.txt](https://github.com/apache/cloudstack/files/12291221/test-1273-small-controller.txt)
   
[test-1273-small-node1.txt](https://github.com/apache/cloudstack/files/12291224/test-1273-small-node1.txt)
   
   **FAILED**
   
[test-1273-small-many-controller.txt](https://github.com/apache/cloudstack/files/12291222/test-1273-small-many-controller.txt)
   
[test-1273-small-many-node1.txt](https://github.com/apache/cloudstack/files/12291223/test-1273-small-many-node1.txt)
   
[test-1273-small-many-node5.txt](https://github.com/apache/cloudstack/files/12291554/test-1273-small-many-node5.txt)
   
   **SUCCEEDED**
   
[test-1273-big-controller.txt](https://github.com/apache/cloudstack/files/12291216/test-1273-big-controller.txt)
   
[test-1273-big-node1.txt](https://github.com/apache/cloudstack/files/12291219/test-1273-big-node1.txt)
   
   **FAILED**
   
[test-1273-big-many-controller.txt](https://github.com/apache/cloudstack/files/12291217/test-1273-big-many-controller.txt)
   
[test-1273-big-many-node1.txt](https://github.com/apache/cloudstack/files/12291218/test-1273-big-many-node1.txt)
   (failed on the first node)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [cloudstack] saffronjam opened a new issue, #7829: Kubernetes cluster sometimes end up in Error state: fails to attach Binary ISO

Reply via email to