[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user koushik-das commented on the issue: https://github.com/apache/cloudstack/pull/1640 @marcaurele Thats correct. In case of shared/remote storage the same disk is used to spawn a VM on another host once the VM is successfully fenced. If the fencer has successfully fenced off a VM, it is assumed that the original VM is correctly stopped. Now if you are saying that the original VM continues to run then that means that the specific fencer has bugs and needs fixing. Note that there are different types of fencer available in cloudstack based on hypervisor types. @abhinandanprateek In the scenario you mentioned vmsync won't be able to mark VM as stopped as the ping command is no longer running as the host is in alert/down state. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user abhinandanprateek commented on the issue: https://github.com/apache/cloudstack/pull/1640 @marcaurele for a host that is found to be down we go ahead and schedule a restart for HA enabled VM, this is good. For the VMs that are not HA enabled they will continue to show as running. This works in the scenario where the host finally comes around. What if host is gone for long or forever, then the VMs will continue to show as running. The user will have to guess that he has to stop and then start the VM. Can you check if VMs will be eventually marked down by VM sync ? If that is the case, I think this fix should be good then Another suggestion: In the specific case where host drops and then come back in certain interval. Can we make the parameter that times out to mark a host down as configurable. In your case you can increase it to several hours and it will not start HA during that time and host can still connect back ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user marcaurele commented on the issue: https://github.com/apache/cloudstack/pull/1640 To get back to your previous comment @koushik-das on the broken scenario: what happen if the host is not reachable and the VMs are using a remote storage. With the fencing operation marking the VM as stopped, does it mean that the same remote disk volume is used if the VM is spawned on another host (while the other one still running on the first host)? @abhinandanprateek if the reason to fence off the VM is to clean up resources, IMO this should be the job of the VM sync, on the ping command/startup command. In case a host is lost, the capacity of the cluster should reflect the lose of that host and the stat capacity should calculate its value based on the hosts that are Up only. When a host comes back (possibly with some VMs still running), the startup command should sync the VM states and the capacity of the cluster/zone should be updated. In short, cleaning up resources that are not "reachable" anymore should not be needed and should not be taken into account when calculating the actual capacity of the cluster/zone. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user koushik-das commented on the issue: https://github.com/apache/cloudstack/pull/1640 I had already mentioned in a previous comment that there is no need for this PR in 4.9/master. So that means a -1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user rhtyd commented on the issue: https://github.com/apache/cloudstack/pull/1640 @marcaurele @koushik-das @jburwell @abhinandanprateek can we have a conclusion on this PR, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user koushik-das commented on the issue: https://github.com/apache/cloudstack/pull/1640 @abhinandanprateek In latest master the sequence of event described above only happens when the host has been determined as 'Down'. Refer to the below code. So the bug described won't happen. Earlier even when host state was 'Alert' the same sequence used to get triggered which possibly killed healthy VMs. > if (host != null && host.getStatus() == Status.Down) { > _haMgr.scheduleRestartForVmsOnHost(host, true); > } In case there is still a possibility of healthy VMs getting killed then the scenario needs to be clearly identified. If we need to fix anything, the first thing would be look at improving the VM investigators rather than changing the existing fencing logic. If we go ahead with the above fix then I can think of the following scenario that is broken. In case of a genuine host down scenario non-HA VMs continue to remain in 'Running' state and no operations can be done on it. Currently non-HA VMs are marked as 'Stopped' after fencing is successful and they can be manually started on another host. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user abhinandanprateek commented on the issue: https://github.com/apache/cloudstack/pull/1640 @jburwell @koushik-das @marcaurele When MS is unable to determine the state of the VM, Or it thinks VM requires a HA operation then it issues a stop command as part of fence operation. The affect of this is to clean up the resources on the MS and keep the resource book keeping on MS in tack. This has the potential to kill a healthy VM in some boundary cases. We need to fix these boundary cases. In case this cleanup/fence operation does not happen on MS then the resource allocation on MS will be not in sync with the actual capacity causes further complications and issues. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user jburwell commented on the issue: https://github.com/apache/cloudstack/pull/1640 @abhinandanprateek @koushik-das @marcaurele have you been able to come to agreement about the correct functionality for this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user marcaurele commented on the issue: https://github.com/apache/cloudstack/pull/1640 @jburwell I changed the commit and PR to point to 4.9 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user abhinandanprateek commented on the issue: https://github.com/apache/cloudstack/pull/1640 @marcaurele @koushik-das When the MS thinks that the VM is down, it issues a stop command. This is done to clear up the resources on management server db tied up for that VM. Now it is seen several times that this actually kills a healthy VM. I have seen this issue in MS cluster with agent.lb turned on. The issue is that we do need a state cleanup when a running VM is found to be stopped on the host. But this probably should not induce a shutdown on the host ? really, but again this is a tricky boundary condition. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user marcaurele commented on the issue: https://github.com/apache/cloudstack/pull/1640 @koushik-das which is IMHO wrong. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user koushik-das commented on the issue: https://github.com/apache/cloudstack/pull/1640 @marcaurele > What is the reason to try fencing off VMs when the MS is not able to determine its state? I cannot see a good reason so far but you seem to think there is at least one. Can you explain it? If MS is able to determine VM state as up or down why do you need fencing? Fencing is tried only when the state cannot be determined. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user marcaurele commented on the issue: https://github.com/apache/cloudstack/pull/1640 @koushik-das > If the MS is not able to determine the state of the VM, it tries fencing off the VM (using the various fencers available). If VM cannot be fenced off successfully, the state of the VM is left unchanged. Apparently I found a way where the VMs are successfully fenced off even though they should not. What is the reason to try fencing off VMs when the MS is not able to determine its state? I cannot see a good reason so far but you seem to think there is at least one. Can you explain it? @jburwell It does not cover my case exactly as it's a timing issue. I'll keep a note to find a way to create a scenario. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user koushik-das commented on the issue: https://github.com/apache/cloudstack/pull/1640 Also since the automated test coverage of this area is less, changes should be made after taking into account all possible scenarios. Otherwise there might be regressions in some valid scenarios. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user koushik-das commented on the issue: https://github.com/apache/cloudstack/pull/1640 @jburwell The issue that has been reported is on a custom branch, probably @marcaurele needs to cherry-pick some additional commits from ACS. Master/4.9 doesn't have this issue, so in that way the PR is not needed. @marcaurele Please read my last comment again and go through the restart() method logic in HA manager code. >>> If the management server cannot determine the state of the VM, it could mark them as stopped (even though I don't think it should). But it should not create a StopVM job, because that might trigger a proper stop of the VM if the agent is reconnecting while the job is picked by async job workers. The above is not correct. If the MS is not able to determine the state of the VM, it tries fencing off the VM (using the various fencers available). If VM cannot be fenced off successfully, the state of the VM is left unchanged. Also if any of the investigators is able to determine the VM state as Down then only the VM is marked as stopped. Hope that clarifies things. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user jburwell commented on the issue: https://github.com/apache/cloudstack/pull/1640 @marcaurele per @koushik-das, what is the issue with re-pointing this PR to the 4.9 release branch? When the PR is merged, it will be forward merged to master. Therefore, the concern you expressed about the change getting to master is not an issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user jburwell commented on the issue: https://github.com/apache/cloudstack/pull/1640 @marcaurele I agree that an ``UNKNOWN`` state is proper way to handle a network partition. As an example, out-of-band management uses this approach which the management server loses connectivity with the IPMI interface. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user marcaurele commented on the issue: https://github.com/apache/cloudstack/pull/1640 I understand your point of the release, but we're not in an ideal world where everyone runs the latest version. I try to do my best to look at the current code in CS to find possible fixes of any bug/problem we encounter or changes we want to do in our version. I want us to get back to the master version but that's not the topic here, neither going to happen in the next weeks. The point 2 does not make sense to me. If the management server cannot determine the state of the VM, it could mark them as stopped (*even though I don't think it should*). But it should not create a StopVM job, because that might trigger a proper stop of the VM if the agent is reconnecting while the job is picked by async job workers. If the VM is really down because the host has crashed, then the command is pointless, and in a customer point of view it would not make a difference. If the host is still up and fine, but we have a network glitch, then requesting a stop of the VM is really bad in a customer point of view. By not doing anything, not requesting a stop, we would end up in a better situation. Going back to which state should be set on the VM when the management server cannot determine it, taking the assumption that the VM is stopped because the management server cannot reach the agent is as much incorrect as leaving it as it is (running, migrating, creating...). I'd rather create a new state `UNKNOWN` for such special case, when the management server does really not know. In a management point of view it will be also easier to know there are *ghost* VMs somewhere for which the management server cannot determine the exact state and proper investigation (*manual*) should be done if the state stays like this, regarding the billing part too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user koushik-das commented on the issue: https://github.com/apache/cloudstack/pull/1640 Please use a proper ACS release for reporting bugs. In your case you may have to do some additional cherry-picks. "Schedule restart" does multiple tasks. There is a method by this name in code, the name may not be the most appropriate but it does the following. So don't get confused with the name 1. Tries to find out if the VM is alive or not 2. If it is not able to determine conclusively if VM is alive, then it tries to fence off VM 3. After successful fencing, HA enabled VMs are restarted on another host, non-HA VMs are marked as Stopped So as you see non-HA VMs are simply stopped when the host is determined as down and not restarted. It makes sense to mark them as stopped so that subsequent operations can be performed on the VMs, for e.g. selective VMs may be explicitly started on another host. If a host is down then power sync won't happen for that host and VM states on that host won't get updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user marcaurele commented on the issue: https://github.com/apache/cloudstack/pull/1640 @koushik-das We are running a fork based on 4.4.2 with lots of cherry-pickings. But even if the host is down, why would you want to schedule a restart if the VM are not HA enabled? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user koushik-das commented on the issue: https://github.com/apache/cloudstack/pull/1640 @marcaurele Based on the initial few lines of the logs the agent went to Alert state. srv02 2016-08-08 11:56:03,895 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) The next status of agent 44692is Alert, current status is Up srv02 2016-08-08 11:56:03,896 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) Deregistering link for 44692 with state Alert As per the latest ACS code (4.9/master) restart of VMs on a host are scheduled only if the state of host is determined as Down. In case of Alert nothing is done. On what version of CS are you seeing this issue? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user marcaurele commented on the issue: https://github.com/apache/cloudstack/pull/1640 > Would you mind adding these notes to the bug ticket? @jburwell All PR comments are going automatically into the jira ticket comment thanks to the ID matching (I think). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user jburwell commented on the issue: https://github.com/apache/cloudstack/pull/1640 Are there Marvin test cases to verify this behavior? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user jburwell commented on the issue: https://github.com/apache/cloudstack/pull/1640 @marcaurele Would you mind adding these notes to the bug ticket? It seems like valuable information that people searching JIRA for issues would find very useful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user marcaurele commented on the issue: https://github.com/apache/cloudstack/pull/1640 I'll do the long explanation regarding the bug in production we had. Twice we had the VPN going down between 2 of our zones which resulted in the lost of the communication between the management server and all the agents of the other zone. After a few minutes the network went back and as the agents were reconnecting we started to see VMs being shutdown, even though we don't have HA enabled. We have 2 management servers in front of haproxy to balance agents & requests. When the network went back, the agent reconnected to the other management server 01. The server 02 had already started to issue some commands to shutdown the VM. The important lines are here, I kept the line showing one VM being shutdown `111750`: ``` srv02 2016-08-08 11:56:03,854 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) PingInvestigator was able to determine host 44692 is in Disconnected srv02 2016-08-08 11:56:03,855 WARN [agent.manager.AgentManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) Agent is disconnected but the host is still up: 44692-virt-hv041 srv02 2016-08-08 11:56:03,857 WARN [apache.cloudstack.alerts] (AgentTaskPool-16:ctx-8b5b6956) alertType:: 7 // dataCenterId:: 2 // podId:: 2 // clusterId:: null // message:: Host disconnected, name: virt-hv041 (id:44692), availability zone: ch-dk-2, pod: DK2-AZ1-POD01 srv02 2016-08-08 11:56:03,884 INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) Host 44692 is disconnecting with event AgentDisconnected srv02 2016-08-08 11:56:03,895 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) The next status of agent 44692is Alert, current status is Up srv02 2016-08-08 11:56:03,896 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) Deregistering link for 44692 with state Alert srv02 2016-08-08 11:56:03,896 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) Remove Agent : 44692 srv02 2016-08-08 11:56:03,897 DEBUG [agent.manager.AgentAttache] (AgentTaskPool-16:ctx-8b5b6956) Seq 44692-4932286016900831481: Sending disconnect to class com.cloud.network.security.SecurityGroupListener srv02 2016-08-08 11:56:03,897 DEBUG [agent.manager.AgentAttache] (AgentTaskPool-16:ctx-8b5b6956) Seq 44692-4932286016900834802: Sending disconnect to class com.cloud.network.security.SecurityGroupListener srv02 2016-08-08 11:56:03,897 DEBUG [agent.manager.AgentAttache] (AgentTaskPool-16:ctx-8b5b6956) Seq 44692-4932286016900886805: Sending disconnect to class com.cloud.network.security.SecurityGroupListener ... srv02 2016-08-08 11:56:03,979 DEBUG [cloud.network.NetworkUsageManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) Disconnected called on 44692 with status Alert srv02 2016-08-08 11:56:03,987 DEBUG [cloud.host.Status] (AgentTaskPool-16:ctx-8b5b6956) Transition:[Resource state = Enabled, Agent event = AgentDisconnected, Host id = 44692, name = virt-hv041] srv02 2016-08-08 11:56:03,998 DEBUG [cloud.cluster.ClusterManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) Forwarding [{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":44692,"event":"AgentDisconnected","contextMap":{},"wait":0}}] to 345049010805 srv02 2016-08-08 11:56:03,998 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Worker-2:ctx-e087e71f) Cluster PDU 90520739220960 -> 345049010805. agent: 44692, pdu seq: 15, pdu ack seq: 0, json: [{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":44692,"event":"AgentDisconnected","contextMap":{},"wait":0}}] srv01 2016-08-08 11:56:04,002 DEBUG [agent.manager.ClusteredAgentManagerImpl] (Cluster-Worker-9:ctx-9101a6d4) Dispatch ->44692, json: [{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":44692,"event":"AgentDisconnected","contextMap":{},"wait":0}}] srv01 2016-08-08 11:56:04,002 DEBUG [agent.manager.ClusteredAgentManagerImpl] (Cluster-Worker-9:ctx-9101a6d4) Intercepting command for agent change: agent 44692 event: AgentDisconnected srv01 2016-08-08 11:56:04,002 DEBUG [agent.manager.ClusteredAgentManagerImpl] (Cluster-Worker-9:ctx-9101a6d4) Received agent disconnect event for host 44692 srv02 2016-08-08 11:56:04,002 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Worker-2:ctx-e087e71f) Cluster PDU 90520739220960 -> 345049010805 completed. time: 3ms. agent: 44692, pdu seq: 15, pdu ack seq: 0, json: [{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":44692,"event":"AgentDisconnected","contextMap":{},"wait":0}}] srv01 2016-08-08 11:56:04,004 DEBUG [agent.manager.ClusteredAgentManagerImpl] (Cluster-Worker-9:ctx-9101a6d4) Not processing AgentDisconnected event for the host id=44692 as the host is directly connected to the current management server 345049010805 srv02 2016-08-08 11:56:04,004 WARN [cloud.ha.HighAvailabilityManagerImpl]
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user koushik-das commented on the issue: https://github.com/apache/cloudstack/pull/1640 @marcaurele Can you share the MS logs for this issue? We need to understand the exact cause for restart of VM? When an agent/host is detected as 'Down', CS tries to check if VMs on it are alive or not, if found alive nothing is done on the VM. Also if you think that the host got disconnected intermittently, then there are ways to adjust the timeout in CS after which it will start investigating the host status. Try adjusting the ping.timeout configuration parameter to see if the issue is resolved. The investigation to check if VM is alive or not is done for all VMs irrespective of HA enabled or not. If a host is really down then it makes sense to mark the VMs as stopped. Additionally for HA enabled VMs, after they are successfully fenced off, attempt is made to restart them on other hosts in the cluster. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user blueorangutan commented on the issue: https://github.com/apache/cloudstack/pull/1640 Packaging result: âcentos6 âcentos7 âdebian repo: http://packages.shapeblue.com/cloudstack/pr/1640 Job ID-89 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user blueorangutan commented on the issue: https://github.com/apache/cloudstack/pull/1640 @rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...
Github user rhtyd commented on the issue: https://github.com/apache/cloudstack/pull/1640 @blueorangutan package LGTM, @abhinandanprateek can you review this as well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---