zap51 commented on issue #7829:
URL: https://github.com/apache/cloudstack/issues/7829#issuecomment-1678646047

   > > I'll try to reproduce this issue and come back. @weizhouapache has 
provided the patch to only consider images / ISOs if the secondary storage has 
the image in READY state but it still doesn't work.
   > > @saffronjam would you be able to check if the binary ISOs present on the 
secondary storage match on all the NFS shares (RW and RO), perhaps a checksum 
would be good. Also see if you can send the output of `# journalctl -u 
setup-kube-system.service --no-pager` if feasible.
   > > Also, do we happen to see any network issues by any chance between the 
hypervisors and the NFS server? I haven't took a total look on the logs, I'll 
come back. As far as I see, this has something to do with the binary images but 
not sure.
   > > Thanks, Jayanth
   > 
   > @zap51 good point, it might be possible that the ISO is Ready (CloudStack 
state) on the secondary storages but actually not
   
   @weizhouapache The scenarios are as follows
   
   1. The secondary storage is totally unavailable where the NFS client can't 
even mount it. In this case it appears, the exception is thrown
   
   ```
   Failed to attach binaries ISO for VM : cks-many-node-189f864d00a in the 
Kubernetes cluster name: cks-many
   ``` 
   
   So, if the hypervisor can't mount the share which should be something like 
the below, the above exception is thrown.
   
   ```
   10.231.15.201:/secondary/template/tmpl/1/213 nfs4      17T  5.5G   17T   1% 
/mnt/59c5727a-6a4a-32c0-9e27-fc300161f497
   ```
   
   2. In some cases (where I've seen) NFS server becomes buggy and doesn't 
accept reads or writes but mount operations go through might end up in this 
issue.
   
   3. If there are multiple secondary storage servers, below could be some 
possible scenarios
     - The secondary storage (RW) might be experiencing issues like in point 
(2).
     - The secondary storage (RO) doesn't have the ISO. The directory under 
`template/tmpl/1/<id>` where`<id>` should have not got created but this should 
end up in point (1).
     - The secondary storage (RO) has the directory structure created 
`template/tmpl/1/<id>` but there is no ISO or there is partial ISO named 
something `_tmp`. In this case, the mount should have gone through but the ISO 
can not be attached. Does it throw the same exception as point (1)?
   
   5. @weizhouapache How does ACS find which secondary storage to mount? Does 
it mount random secondary storage servers or has some order?
   
   @saffronjam Maybe we can get started by finding which secondary storage has 
the ISOs present. This should be found under `template/tmpl/1/<id>`. The 
`template.properties` under each `<id>` file should have the description of the 
template. 
   During the reproducing of the issue, please check which of your NFS servers 
are being mounted on the hypervisors using something like `$ df -Th | grep nfs` 
and also simultaneously try checking your worker / master nodes failing to 
mount the ISOs,
   
   ```
   ssh -p <ssh_port> -o stricthostkeychecking=off cloud@<Ingress_IP> 'sudo grep 
"/mnt/k8*" /var/log/daemon.log'
   ```
   
   Also please share the management server logs.
   
   Thanks,
   Jayanth Reddy


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to