zap51 commented on issue #7829:
URL: https://github.com/apache/cloudstack/issues/7829#issuecomment-1678646047
> > I'll try to reproduce this issue and come back. @weizhouapache has
provided the patch to only consider images / ISOs if the secondary storage has
the image in READY state but it still doesn't work.
> > @saffronjam would you be able to check if the binary ISOs present on the
secondary storage match on all the NFS shares (RW and RO), perhaps a checksum
would be good. Also see if you can send the output of `# journalctl -u
setup-kube-system.service --no-pager` if feasible.
> > Also, do we happen to see any network issues by any chance between the
hypervisors and the NFS server? I haven't took a total look on the logs, I'll
come back. As far as I see, this has something to do with the binary images but
not sure.
> > Thanks, Jayanth
>
> @zap51 good point, it might be possible that the ISO is Ready (CloudStack
state) on the secondary storages but actually not
@weizhouapache The scenarios are as follows
1. The secondary storage is totally unavailable where the NFS client can't
even mount it. In this case it appears, the exception is thrown
```
Failed to attach binaries ISO for VM : cks-many-node-189f864d00a in the
Kubernetes cluster name: cks-many
```
So, if the hypervisor can't mount the share which should be something like
the below, the above exception is thrown.
```
10.231.15.201:/secondary/template/tmpl/1/213 nfs4 17T 5.5G 17T 1%
/mnt/59c5727a-6a4a-32c0-9e27-fc300161f497
```
2. In some cases (where I've seen) NFS server becomes buggy and doesn't
accept reads or writes but mount operations go through might end up in this
issue.
3. If there are multiple secondary storage servers, below could be some
possible scenarios
- The secondary storage (RW) might be experiencing issues like in point
(2).
- The secondary storage (RO) doesn't have the ISO. The directory under
`template/tmpl/1/<id>` where`<id>` should have not got created but this should
end up in point (1).
- The secondary storage (RO) has the directory structure created
`template/tmpl/1/<id>` but there is no ISO or there is partial ISO named
something `_tmp`. In this case, the mount should have gone through but the ISO
can not be attached. Does it throw the same exception as point (1)?
5. @weizhouapache How does ACS find which secondary storage to mount? Does
it mount random secondary storage servers or has some order?
@saffronjam Maybe we can get started by finding which secondary storage has
the ISOs present. This should be found under `template/tmpl/1/<id>`. The
`template.properties` under each `<id>` file should have the description of the
template.
During the reproducing of the issue, please check which of your NFS servers
are being mounted on the hypervisors using something like `$ df -Th | grep nfs`
and also simultaneously try checking your worker / master nodes failing to
mount the ISOs,
```
ssh -p <ssh_port> -o stricthostkeychecking=off cloud@<Ingress_IP> 'sudo grep
"/mnt/k8*" /var/log/daemon.log'
```
Also please share the management server logs.
Thanks,
Jayanth Reddy
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]