[jira] [Commented] (CLOUDSTACK-9350) Local storage hosts get HA tasks, cause issues

ASF GitHub Bot (JIRA) Sun, 08 May 2016 21:09:53 -0700

    [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-9350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275915#comment-15275915
 ]


ASF GitHub Bot commented on CLOUDSTACK-9350:
--------------------------------------------

Github user swill commented on the pull request:

    https://github.com/apache/cloudstack/pull/1496#issuecomment-217772228
  
    
    
    ### CI RESULTS
    
    ```
    Tests Run: 88
      Skipped: 2
       Failed: 1
       Errors: 1
     Duration: 11h 25m 09s
    ```
    
    **Summary of the problem(s):**
    ```
    ERROR: Test to verify access to loadbalancer haproxy admin stats page
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File 
"/data/git/cs1/cloudstack/test/integration/smoke/test_internal_lb.py", line 
854, in tearDown
        raise Exception("Cleanup failed with %s" % e)
    Exception: Cleanup failed with Job failed: {jobprocstatus : 0, created : 
u'2016-05-07T12:50:26+0200', jobresult : {errorcode : 530, errortext : u'Failed 
to delete network'}, cmd : 
u'org.apache.cloudstack.api.command.user.network.DeleteNetworkCmd', userid : 
u'b90ec272-1410-11e6-9152-5254001daa61', jobstatus : 2, jobid : 
u'04de60c8-0aa7-4488-a076-a7475b147b47', jobresultcode : 530, jobresulttype : 
u'object', jobinstancetype : u'Network', accountid : 
u'b90e9c7d-1410-11e6-9152-5254001daa61'}
    ----------------------------------------------------------------------
    Additional details in: /tmp/MarvinLogs/test_network_9UCT1L/results.txt
    ```
    
    ```
    FAIL: Test create, assign, remove of an Internal LB with roundrobin http 
traffic to 3 vm's in a Single VPC
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File 
"/data/git/cs1/cloudstack/test/integration/smoke/test_internal_lb.py", line 
599, in test_01_internallb_roundrobin_1VPC_3VM_HTTP_port80
        self.execute_internallb_roundrobin_tests(vpc_offering)
      File 
"/data/git/cs1/cloudstack/test/integration/smoke/test_internal_lb.py", line 
668, in execute_internallb_roundrobin_tests
        self.setup_http_daemon(vm)
      File 
"/data/git/cs1/cloudstack/test/integration/smoke/test_internal_lb.py", line 
519, in setup_http_daemon
        self.fail("Failed to ssh into vm: %s due to %s" % (vm, e))
    AssertionError: Failed to ssh into vm: <marvin.lib.base.VirtualMachine 
instance at 0x3624170> due to not all arguments converted during string 
formatting
    ----------------------------------------------------------------------
    Additional details in: /tmp/MarvinLogs/test_network_9UCT1L/results.txt
    ```
    
    
    
    **Associated Uploads**
    
    **`/tmp/MarvinLogs/DeployDataCenter__May_07_2016_07_03_57_IONYWP:`**
    * 
[dc_entries.obj](https://objects-east.cloud.ca/v1/e465abe2f9ae4478b9fff416eab61bd9/PR1496/tmp/MarvinLogs/DeployDataCenter__May_07_2016_07_03_57_IONYWP/dc_entries.obj)
    * 
[failed_plus_exceptions.txt](https://objects-east.cloud.ca/v1/e465abe2f9ae4478b9fff416eab61bd9/PR1496/tmp/MarvinLogs/DeployDataCenter__May_07_2016_07_03_57_IONYWP/failed_plus_exceptions.txt)
    * 
[runinfo.txt](https://objects-east.cloud.ca/v1/e465abe2f9ae4478b9fff416eab61bd9/PR1496/tmp/MarvinLogs/DeployDataCenter__May_07_2016_07_03_57_IONYWP/runinfo.txt)
    
    **`/tmp/MarvinLogs/test_host_ha_XQC3Z6:`**
    * 
[failed_plus_exceptions.txt](https://objects-east.cloud.ca/v1/e465abe2f9ae4478b9fff416eab61bd9/PR1496/tmp/MarvinLogs/test_host_ha_XQC3Z6/failed_plus_exceptions.txt)
    * 
[results.txt](https://objects-east.cloud.ca/v1/e465abe2f9ae4478b9fff416eab61bd9/PR1496/tmp/MarvinLogs/test_host_ha_XQC3Z6/results.txt)
    * 
[runinfo.txt](https://objects-east.cloud.ca/v1/e465abe2f9ae4478b9fff416eab61bd9/PR1496/tmp/MarvinLogs/test_host_ha_XQC3Z6/runinfo.txt)
    
    **`/tmp/MarvinLogs/test_network_9UCT1L:`**
    * 
[failed_plus_exceptions.txt](https://objects-east.cloud.ca/v1/e465abe2f9ae4478b9fff416eab61bd9/PR1496/tmp/MarvinLogs/test_network_9UCT1L/failed_plus_exceptions.txt)
    * 
[results.txt](https://objects-east.cloud.ca/v1/e465abe2f9ae4478b9fff416eab61bd9/PR1496/tmp/MarvinLogs/test_network_9UCT1L/results.txt)
    * 
[runinfo.txt](https://objects-east.cloud.ca/v1/e465abe2f9ae4478b9fff416eab61bd9/PR1496/tmp/MarvinLogs/test_network_9UCT1L/runinfo.txt)
    
    **`/tmp/MarvinLogs/test_vpc_routers_UTADLF:`**
    * 
[failed_plus_exceptions.txt](https://objects-east.cloud.ca/v1/e465abe2f9ae4478b9fff416eab61bd9/PR1496/tmp/MarvinLogs/test_vpc_routers_UTADLF/failed_plus_exceptions.txt)
    * 
[results.txt](https://objects-east.cloud.ca/v1/e465abe2f9ae4478b9fff416eab61bd9/PR1496/tmp/MarvinLogs/test_vpc_routers_UTADLF/results.txt)
    * 
[runinfo.txt](https://objects-east.cloud.ca/v1/e465abe2f9ae4478b9fff416eab61bd9/PR1496/tmp/MarvinLogs/test_vpc_routers_UTADLF/runinfo.txt)
    
    
    Uploads will be available until `2016-07-09 02:00:00 +0200 CEST`
    
    *Comment created by [`upr comment`](https://github.com/cloudops/upr).*



> Local storage hosts get HA tasks, cause issues        
> -----------------------------------------------
>
>                 Key: CLOUDSTACK-9350
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-9350
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>    Affects Versions: 4.5.1
>            Reporter: Abhinandan Prateek
>            Assignee: Abhinandan Prateek
>
> When a host hits its ping time out, for whatever reason, the investigators 
> are triggered. The KVMInvestigator sends a CheckOnHostCommand to the target 
> host, and then to all the remaining neighbor hosts in the cluster. The 
> CheckOnHostCommand (and also FenceCommand, the code is nearly identical) is 
> processed by the KVM agent and simply scans through all NFS primary storage 
> looking for the host's heartbeat in the KVMHA directory. If no heartbeat file 
> is found, it fails the check. In the case of clusters that are local-only, 
> these hosts will always fail the check, whether it be the target host or a 
> neighbor checking on the target. This triggers a host 'down' event, which 
> triggers HA tasks. The HA tasks will attempt to stop any VMs on the host, and 
> then if the VM's offering is HA-enabled it will try to restart the VM.
> Our recent issue was that a management server took extraordinarily long to 
> rotate its logs and was slow to process some host pings. The 
> CheckOnHostCommand was sent to a suspect host, which failed because it had no 
> primary NFS. The neighbor checks also failed to check the suspect host's 
> heartbeat for the same reason. Then the host was marked as down and all VMs 
> were stopped. Multiply this by a few dozen hosts.
> The immediate fix, provided in the example, is a patch to KVMInvestigator 
> which will only attempt investigation if the host's cluster has NFS storage, 
> which is a requirement for the host to run the check, as described above. If 
> there is none, the host state is determined to be disconnected rather than 
> down. This means that the host will still end up in alert state and need 
> manual investigation, but there will be no attempt to stop or HA the VMs.
> Additionally, the patch catches scenarios where a cluster might have both NFS 
> and local storage and a host ends up in 'down' state. In this case, when the 
> HA tasks are being created, if a VM is using local storage then the HA task 
> generation is skipped. This VM can't be started anywhere else.
> We could also make the agent side more robust, in KVMHAChecker we may not 
> want it to return 'false' if there were zero pools passed to check for HA 
> heartbeat. Then again, maybe we do. We decided initially to patch just the 
> server side, because it is easier to deploy.
> In the long run, I'd hope that the current HA work would supercede the 
> current KVMInvestigator and take the cluster's ability to pass any defined 
> checks into account before checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CLOUDSTACK-9350) Local storage hosts get HA tasks, cause issues

Reply via email to