[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-11-21 Thread koushik-das
Github user koushik-das commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
@marcaurele Thats correct. In case of shared/remote storage the same disk 
is used to spawn a VM on another host once the VM is successfully fenced. If 
the fencer has successfully fenced off a VM, it is assumed that the original VM 
is correctly stopped. Now if you are saying that the original VM continues to 
run then that means that the specific fencer has bugs and needs fixing. Note 
that there are different types of fencer available in cloudstack based on 
hypervisor types.

@abhinandanprateek In the scenario you mentioned vmsync won't be able to 
mark VM as stopped as the ping command is no longer running as the host is in 
alert/down state.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-11-21 Thread abhinandanprateek
Github user abhinandanprateek commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
@marcaurele for a host that is found to be down we go ahead and schedule a 
restart for HA enabled VM, this is good.

For the VMs that are not HA enabled they will continue to show as running.  
This works in the scenario where the host finally comes around. What if host is 
gone for long or forever, then the VMs will continue to show as running. The 
user will have to guess that he has to stop and then start the VM.  Can you 
check if VMs will be eventually marked down by VM sync ? If that is the case, I 
think this fix should be good then 

Another suggestion: In the specific case where host drops and then come 
back in certain interval. Can we make the parameter that times out to mark a 
host down as configurable. In your case you can increase it to several hours 
and it will not start HA during that time and host can still connect back ?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-11-21 Thread marcaurele
Github user marcaurele commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
To get back to your previous comment @koushik-das on the broken scenario: 
what happen if the host is not reachable and the VMs are using a remote 
storage. With the fencing operation marking the VM as stopped, does it mean 
that the same remote disk volume is used if the VM is spawned on another host 
(while the other one still running on the first host)?

@abhinandanprateek if the reason to fence off the VM is to clean up 
resources, IMO this should be the job of the VM sync, on the ping 
command/startup command. In case a host is lost, the capacity of the cluster 
should reflect the lose of that host and the stat capacity should calculate its 
value based on the hosts that are Up only. When a host comes back (possibly 
with some VMs still running), the startup command should sync the VM states and 
the capacity of the cluster/zone should be updated. 
In short, cleaning up resources that are not "reachable" anymore should not 
be needed and should not be taken into account when calculating the actual 
capacity of the cluster/zone.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-11-21 Thread koushik-das
Github user koushik-das commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
I had already mentioned in a previous comment that there is no need for 
this PR in 4.9/master. So that means a -1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-11-20 Thread rhtyd
Github user rhtyd commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
@marcaurele @koushik-das @jburwell @abhinandanprateek can we have a 
conclusion on this PR, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-09-18 Thread koushik-das
Github user koushik-das commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
@abhinandanprateek In latest master the sequence of event described above 
only happens when the host has been determined as 'Down'. Refer to the below 
code. So the bug described won't happen. Earlier even when host state was 
'Alert' the same sequence used to get triggered which possibly killed healthy 
VMs.

> if (host != null && host.getStatus() == Status.Down) {
> _haMgr.scheduleRestartForVmsOnHost(host, true);
> }

In case there is still a possibility of healthy VMs getting killed then the 
scenario needs to be clearly identified. If we need to fix anything, the first 
thing would be look at improving the VM investigators rather than changing the 
existing fencing logic.

If we go ahead with the above fix then I can think of the following 
scenario that is broken. In case of a genuine host down scenario non-HA VMs 
continue to remain in 'Running' state and no operations can be done on it. 
Currently non-HA VMs are marked as 'Stopped' after fencing is successful and 
they can be manually started on another host.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-09-18 Thread abhinandanprateek
Github user abhinandanprateek commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
@jburwell @koushik-das @marcaurele When MS is unable to determine the state 
of the VM, Or it thinks VM requires a HA operation then it issues a stop 
command as part of fence operation.
The affect of this is to clean up the resources on the MS and keep the 
resource book keeping on MS in tack. This has the potential to kill a healthy 
VM in some boundary cases. We need to fix these boundary cases.
In case this cleanup/fence operation does not happen on MS then the 
resource allocation on MS will be not in sync with the actual capacity causes 
further complications and issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-09-18 Thread jburwell
Github user jburwell commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
@abhinandanprateek @koushik-das @marcaurele have you been able to come to 
agreement about the correct functionality for this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-09-13 Thread marcaurele
Github user marcaurele commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
@jburwell I changed the commit and PR to point to 4.9


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-08-30 Thread abhinandanprateek
Github user abhinandanprateek commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
@marcaurele @koushik-das When the MS thinks that the VM is down, it issues 
a stop command. This is done to clear up the resources on management server db 
tied up for that VM. Now it is seen several times that this actually kills a 
healthy VM. I have seen this issue in MS cluster with agent.lb turned on.
The issue is that we do need a state cleanup when a running VM is found to 
be stopped on the host. But this probably should not induce a shutdown on the 
host ? really, but again this is a tricky boundary condition.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-08-24 Thread marcaurele
Github user marcaurele commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
@koushik-das which is IMHO wrong.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-08-24 Thread koushik-das
Github user koushik-das commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
@marcaurele 

> What is the reason to try fencing off VMs when the MS is not able to 
determine its state? I cannot see a good reason so far but you seem to think 
there is at least one. Can you explain it?

If MS is able to determine VM state as up or down why do you need fencing? 
Fencing is tried only when the state cannot be determined.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-08-23 Thread marcaurele
Github user marcaurele commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
@koushik-das 
> If the MS is not able to determine the state of the VM, it tries fencing 
off the VM (using the various fencers available). If VM cannot be fenced off 
successfully, the state of the VM is left unchanged. 

Apparently I found a way where the VMs are successfully fenced off even 
though they should not.

What is the reason to try fencing off VMs when the MS is not able to 
determine its state? I cannot see a good reason so far but you seem to think 
there is at least one. Can you explain it?

@jburwell It does not cover my case exactly as it's a timing issue. I'll 
keep a note to find a way to create a scenario.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-08-22 Thread koushik-das
Github user koushik-das commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
Also since the automated test coverage of this area is less, changes should 
be made after taking into account all possible scenarios. Otherwise there might 
be regressions in some valid scenarios. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-08-22 Thread koushik-das
Github user koushik-das commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
@jburwell The issue that has been reported is on a custom branch, probably 
@marcaurele needs to cherry-pick some additional commits from ACS. Master/4.9 
doesn't have this issue, so in that way the PR is not needed.

@marcaurele Please read my last comment again and go through the restart() 
method logic in HA manager code.
>>> If the management server cannot determine the state of the VM, it could 
mark them as stopped (even though I don't think it should). But it should not 
create a StopVM job, because that might trigger a proper stop of the VM if the 
agent is reconnecting while the job is picked by async job workers.
The above is not correct. If the MS is not able to determine the state of 
the VM, it tries fencing off the VM (using the various fencers available). If 
VM cannot be fenced off successfully, the state of the VM is left unchanged. 
Also if any of the investigators is able to determine the VM state as Down then 
only the VM is marked as stopped. Hope that clarifies things.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-08-22 Thread jburwell
Github user jburwell commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
@marcaurele per @koushik-das, what is the issue with re-pointing this PR to 
the 4.9 release branch?  When the PR is merged, it will be forward merged to 
master.  Therefore, the concern you expressed about the change getting to 
master is not an issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-08-22 Thread jburwell
Github user jburwell commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
@marcaurele I agree that an ``UNKNOWN`` state is proper way to handle a 
network partition.  As an example, out-of-band management uses this approach 
which the management server loses connectivity with the IPMI interface.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-08-18 Thread marcaurele
Github user marcaurele commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
I understand your point of the release, but we're not in an ideal world 
where everyone runs the latest version. I try to do my best to look at the 
current code in CS to find possible fixes of any bug/problem we encounter or 
changes we want to do in our version. I want us to get back to the master 
version but that's not the topic here, neither going to happen in the next 
weeks.

The point 2 does not make sense to me. If the management server cannot 
determine the state of the VM, it could mark them as stopped (*even though I 
don't think it should*). But it should not create a StopVM job, because that 
might trigger a proper stop of the VM if the agent is reconnecting while the 
job is picked by async job workers.
If the VM is really down because the host has crashed, then the command is 
pointless, and in a customer point of view it would not make a difference. If 
the host is still up and fine, but we have a network glitch, then requesting a 
stop of the VM is really bad in a customer point of view. By not doing 
anything, not requesting a stop, we would end up in a better situation.

Going back to which state should be set on the VM when the management 
server cannot determine it, taking the assumption that the VM is stopped 
because the management server cannot reach the agent is as much incorrect as 
leaving it as it is (running, migrating, creating...). I'd rather create a new 
state `UNKNOWN` for such special case, when the management server does really 
not know. In a management point of view it will be also easier to know there 
are *ghost* VMs somewhere for which the management server cannot determine the 
exact state and proper investigation (*manual*) should be done if the state 
stays like this, regarding the billing part too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-08-18 Thread koushik-das
Github user koushik-das commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
Please use a proper ACS release for reporting bugs. In your case you may 
have to do some additional cherry-picks.

"Schedule restart" does multiple tasks. There is a method by this name in 
code, the name may not be the most appropriate but it does the following. So 
don't get confused with the name
1. Tries to find out if the VM is alive or not
2. If it is not able to determine conclusively if VM is alive, then it 
tries to fence off VM
3. After successful fencing, HA enabled VMs are restarted on another host, 
non-HA VMs are marked as Stopped

So as you see non-HA VMs are simply stopped when the host is determined as 
down and not restarted. It makes sense to mark them as stopped so that 
subsequent operations can be performed on the VMs, for e.g. selective VMs may 
be explicitly started on another host. If a host is down then power sync won't 
happen for that host and VM states on that host won't get updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-08-18 Thread marcaurele
Github user marcaurele commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
@koushik-das We are running a fork based on 4.4.2 with lots of 
cherry-pickings.

But even if the host is down, why would you want to schedule a restart if 
the VM are not HA enabled?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-08-18 Thread koushik-das
Github user koushik-das commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
@marcaurele Based on the initial few lines of the logs the agent went to 
Alert state.

srv02 2016-08-08 11:56:03,895 DEBUG [agent.manager.AgentManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) The next status of agent 44692is Alert, current 
status is Up
srv02 2016-08-08 11:56:03,896 DEBUG [agent.manager.AgentManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) Deregistering link for 44692 with state Alert

As per the latest ACS code (4.9/master) restart of VMs on a host are 
scheduled only if the state of host is determined as Down. In case of Alert 
nothing is done.

On what version of CS are you seeing this issue?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-08-18 Thread marcaurele
Github user marcaurele commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
> Would you mind adding these notes to the bug ticket?

@jburwell All PR comments are going automatically into the jira ticket 
comment thanks to the ID matching (I think).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-08-17 Thread jburwell
Github user jburwell commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
Are there Marvin test cases to verify this behavior?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-08-17 Thread jburwell
Github user jburwell commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
@marcaurele Would you mind adding these notes to the bug ticket?  It seems 
like valuable information that people searching JIRA for issues would find very 
useful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-08-17 Thread marcaurele
Github user marcaurele commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
I'll do the long explanation regarding the bug in production we had.

Twice we had the VPN going down between 2 of our zones which resulted in 
the lost of the communication between the management server and all the agents 
of the other zone. After a few minutes the network went back and as the agents 
were reconnecting we started to see VMs being shutdown, even though we don't 
have HA enabled. We have 2 management servers in front of haproxy to balance 
agents & requests. When the network went back, the agent reconnected to the 
other management server 01. The server 02 had already started to issue some 
commands to shutdown the VM.

The important lines are here, I kept the line showing one VM being shutdown 
`111750`:

```
srv02 2016-08-08 11:56:03,854 DEBUG [cloud.ha.HighAvailabilityManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) PingInvestigator was able to determine host 
44692 is in Disconnected
srv02 2016-08-08 11:56:03,855 WARN  [agent.manager.AgentManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) Agent is disconnected but the host is still up: 
44692-virt-hv041
srv02 2016-08-08 11:56:03,857 WARN  [apache.cloudstack.alerts] 
(AgentTaskPool-16:ctx-8b5b6956)  alertType:: 7 // dataCenterId:: 2 // podId:: 2 
// clusterId:: null // message:: Host disconnected, name: virt-hv041 
(id:44692), availability zone: ch-dk-2, pod: DK2-AZ1-POD01
srv02 2016-08-08 11:56:03,884 INFO  [agent.manager.AgentManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) Host 44692 is disconnecting with event 
AgentDisconnected
srv02 2016-08-08 11:56:03,895 DEBUG [agent.manager.AgentManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) The next status of agent 44692is Alert, current 
status is Up
srv02 2016-08-08 11:56:03,896 DEBUG [agent.manager.AgentManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) Deregistering link for 44692 with state Alert
srv02 2016-08-08 11:56:03,896 DEBUG [agent.manager.AgentManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) Remove Agent : 44692
srv02 2016-08-08 11:56:03,897 DEBUG [agent.manager.AgentAttache] 
(AgentTaskPool-16:ctx-8b5b6956) Seq 44692-4932286016900831481: Sending 
disconnect to class com.cloud.network.security.SecurityGroupListener
srv02 2016-08-08 11:56:03,897 DEBUG [agent.manager.AgentAttache] 
(AgentTaskPool-16:ctx-8b5b6956) Seq 44692-4932286016900834802: Sending 
disconnect to class com.cloud.network.security.SecurityGroupListener
srv02 2016-08-08 11:56:03,897 DEBUG [agent.manager.AgentAttache] 
(AgentTaskPool-16:ctx-8b5b6956) Seq 44692-4932286016900886805: Sending 
disconnect to class com.cloud.network.security.SecurityGroupListener
...
srv02 2016-08-08 11:56:03,979 DEBUG [cloud.network.NetworkUsageManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) Disconnected called on 44692 with status Alert
srv02 2016-08-08 11:56:03,987 DEBUG [cloud.host.Status] 
(AgentTaskPool-16:ctx-8b5b6956) Transition:[Resource state = Enabled, Agent 
event = AgentDisconnected, Host id = 44692, name = virt-hv041]
srv02 2016-08-08 11:56:03,998 DEBUG [cloud.cluster.ClusterManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) Forwarding 
[{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":44692,"event":"AgentDisconnected","contextMap":{},"wait":0}}]
 to 345049010805
srv02 2016-08-08 11:56:03,998 DEBUG [cloud.cluster.ClusterManagerImpl] 
(Cluster-Worker-2:ctx-e087e71f) Cluster PDU 90520739220960 -> 345049010805. 
agent: 44692, pdu seq: 15, pdu ack seq: 0, json: 
[{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":44692,"event":"AgentDisconnected","contextMap":{},"wait":0}}]
srv01 2016-08-08 11:56:04,002 DEBUG 
[agent.manager.ClusteredAgentManagerImpl] (Cluster-Worker-9:ctx-9101a6d4) 
Dispatch ->44692, json: 
[{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":44692,"event":"AgentDisconnected","contextMap":{},"wait":0}}]
srv01 2016-08-08 11:56:04,002 DEBUG 
[agent.manager.ClusteredAgentManagerImpl] (Cluster-Worker-9:ctx-9101a6d4) 
Intercepting command for agent change: agent 44692 event: AgentDisconnected
srv01 2016-08-08 11:56:04,002 DEBUG 
[agent.manager.ClusteredAgentManagerImpl] (Cluster-Worker-9:ctx-9101a6d4) 
Received agent disconnect event for host 44692
srv02 2016-08-08 11:56:04,002 DEBUG [cloud.cluster.ClusterManagerImpl] 
(Cluster-Worker-2:ctx-e087e71f) Cluster PDU 90520739220960 -> 345049010805 
completed. time: 3ms. agent: 44692, pdu seq: 15, pdu ack seq: 0, json: 
[{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":44692,"event":"AgentDisconnected","contextMap":{},"wait":0}}]
srv01 2016-08-08 11:56:04,004 DEBUG 
[agent.manager.ClusteredAgentManagerImpl] (Cluster-Worker-9:ctx-9101a6d4) Not 
processing AgentDisconnected event for the host id=44692 as the host is 
directly connected to the current management server 345049010805
srv02 2016-08-08 11:56:04,004 WARN  [cloud.ha.HighAvailabilityManagerImpl] 

[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-08-17 Thread koushik-das
Github user koushik-das commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
@marcaurele Can you share the MS logs for this issue? We need to understand 
the exact cause for restart of VM? When an agent/host is detected as 'Down', CS 
tries to check if VMs on it are alive or not, if found alive nothing is done on 
the VM.

Also if you think that the host got disconnected intermittently, then there 
are ways to adjust the timeout in CS after which it will start investigating 
the host status. Try adjusting the ping.timeout configuration parameter to see 
if the issue is resolved.

The investigation to check if VM is alive or not is done for all VMs 
irrespective of HA enabled or not. If a host is really down then it makes sense 
to mark the VMs as stopped. Additionally for HA enabled VMs, after they are 
successfully fenced off, attempt is made to restart them on other hosts in the 
cluster.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-08-17 Thread blueorangutan
Github user blueorangutan commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
Packaging result: ✔centos6 ✔centos7 ✔debian repo: 
http://packages.shapeblue.com/cloudstack/pr/1640
Job ID-89


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-08-17 Thread blueorangutan
Github user blueorangutan commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
@rhtyd a Jenkins job has been kicked to build packages. I'll keep you 
posted as I make progress.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] cloudstack issue #1640: CLOUDSTACK-9458: Fix HA bug when VMs are stopped on ...

2016-08-17 Thread rhtyd
Github user rhtyd commented on the issue:

https://github.com/apache/cloudstack/pull/1640
  
@blueorangutan package

LGTM, @abhinandanprateek can you review this as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---