[jira] [Commented] (CLOUDSTACK-4371) [Performance Testing] Basic zone with 20K Hosts, management server restart leaves the hosts in disconnected state for very long time

Koushik Das (JIRA) Mon, 19 Aug 2013 04:26:32 -0700

    [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-4371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13743736#comment-13743736
 ]


Koushik Das commented on CLOUDSTACK-4371:
-----------------------------------------

I verified with XS and found that local storage pool is not created for every 
host reconnect. The pool is added when host gets connected for first time only 
(provided local storage is enabled at zone level). Now during host reconnect 
there is a check to see if the local pool already exists and in that case the 
creation is skipped.

So looks like a simulator setup issue based on the exception.


                
> [Performance Testing] Basic zone with 20K Hosts, management server restart 
> leaves the hosts in disconnected state for very long time
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-4371
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-4371
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>          Components: Management Server
>    Affects Versions: 4.2.0
>         Environment: Basic zone, with over 20K simulator hosts
>            Reporter: Sowmya Krishnan
>              Labels: performance
>             Fix For: 4.2.0
>
>         Attachments: ms1_restartfail.log.gz, ms2_restartfail.log.gz, 
> ms3_restartfail.log.gz
>
>
> Basic zone performance test bed:
> 20K simulator hosts,
> 3 Management servers
> 1 host/cluster
> Local storage
> Java heap size: 12GB
> db.cloud.maxActive=2000
> direct.agent.load.size=1000
> agent.lb.enabled=true
> Deploy around 20K simulator hosts with 3 Management servers clustered
> (Not deployed any VMs yet)
> After all hosts are deployed, stop all 3 Management servers and then start 
> all 3 one after another
> Result
> =====
> Hosts don't get to connected state at all even after 10 minutes. While around 
> 2K of them go into alert state while rest are in disconnected state.
> mysql> select count(*), status, resource_state, type, mgmt_server_id from 
> host group by mgmt_server_id, status, type, resource_state;
> +----------+--------------+----------------+--------------------+----------------+
> | count(*) | status       | resource_state | type               | 
> mgmt_server_id |
> +----------+--------------+----------------+--------------------+----------------+
> |     1946 | Alert        | Enabled        | Routing            |           
> NULL |
> |    18054 | Disconnected | Enabled        | Routing            |           
> NULL |
> |        1 | Disconnected | Enabled        | SecondaryStorageVM |           
> NULL |
> +----------+--------------+----------------+--------------------+----------------+
> 3 rows in set (0.11 sec)
> MS Logs show lot of storage pool exceptions while hosts try to get connected:
> 2013-08-16 05:49:25,592 DEBUG [agent.transport.Request] 
> (AgentTaskPool-12:null) Seq 13-32440322: Sending  { Cmd , MgmtId: 
> 206915885094132, via: 13, Ver: v1, Flags: 100011, [{"com.cloud.agen
> t.api.CleanupNetworkRulesCmd":{"interval":2028,"wait":0}}] }
> 2013-08-16 05:49:25,592 DEBUG [agent.transport.Request] 
> (AgentTaskPool-12:null) Seq 13-32440322: Executing:  { Cmd , MgmtId: 
> 206915885094132, via: 13, Ver: v1, Flags: 100011, [{"com.cloud.a
> gent.api.CleanupNetworkRulesCmd":{"interval":2028,"wait":0}}] }
> 2013-08-16 05:49:25,592 DEBUG [xen.discoverer.XcpServerDiscoverer] 
> (AgentTaskPool-14:null) Not XenServer so moving on.
> 2013-08-16 05:49:25,592 DEBUG [agent.manager.AgentManagerImpl] 
> (AgentTaskPool-14:null) Sending Connect to listener: 
> DeploymentPlanningManagerImpl_EnhancerByCloudStack_76f3d8e4
> 2013-08-16 05:49:25,591 DEBUG [cloud.resource.AgentResourceBase] 
> (ClusteredAgentManager Timer:null) Deserializing simulated agent on reconnect
> 2013-08-16 05:49:25,594 INFO  [network.security.SecurityGroupListener] 
> (AgentTaskPool-12:null) Scheduled network rules cleanup, interval=2028
> 2013-08-16 05:49:25,594 INFO  [network.security.SecurityGroupListener] 
> (AgentTaskPool-12:null) Received a host startup notification
> 2013-08-16 05:49:25,595 DEBUG [agent.manager.AgentManagerImpl] 
> (AgentTaskPool-12:null) Sending Connect to listener: StoragePoolMonitor
> ...
> ...
> 2013-08-16 05:49:25,761 DEBUG [agent.manager.AgentManagerImpl] 
> (AgentTaskPool-12:null) Sending Connect to listener: 
> ClusteredVirtualMachineManagerImpl_EnhancerByCloudStack_b5459b7b
> 2013-08-16 05:49:25,764 DEBUG [cloud.vm.VirtualMachineManagerImpl] 
> (AgentTaskPool-12:null) Found 0 VMs for host 13
> 2013-08-16 05:49:25,765 DEBUG [agent.manager.AgentManagerImpl] 
> (AgentTaskPool-12:null) Sending Connect to listener: LocalStoragePoolListener
> 2013-08-16 05:49:25,768 DEBUG 
> [datastore.lifecycle.CloudStackPrimaryDataStoreLifeCycleImpl] 
> (AgentTaskPool-12:null) createPool Params @ scheme - Filesystem storageHost - 
> 172.1.3.131 hostPath - /mnt/2a2463b4-4fd2-4ac7-ad3f-040a3046e478 port - -1
> 2013-08-16 05:49:25,771 DEBUG 
> [datastore.lifecycle.CloudStackPrimaryDataStoreLifeCycleImpl] 
> (AgentTaskPool-12:null) Another active pool with the same uuid already exists
> 2013-08-16 05:49:25,772 WARN  [cloud.storage.StorageManagerImpl] 
> (AgentTaskPool-12:null) Unable to setup the local storage pool for 
> Host[-13-Routing]
> com.cloud.utils.exception.CloudRuntimeException: Another active pool with the 
> same uuid already exists
>         at 
> org.apache.cloudstack.storage.datastore.lifecycle.CloudStackPrimaryDataStoreLifeCycleImpl.initialize(CloudStackPrimaryDataStoreLifeCycleImpl.java:319)
>         at 
> com.cloud.storage.StorageManagerImpl.createLocalStorage(StorageManagerImpl.java:647)
>         at 
> com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
>         at 
> com.cloud.storage.LocalStoragePoolListener.processConnect(LocalStoragePoolListener.java:86)
>         at 
> com.cloud.agent.manager.AgentManagerImpl.notifyMonitorsOfConnection(AgentManagerImpl.java:587)
>         at 
> com.cloud.agent.manager.AgentManagerImpl.handleDirectConnectAgent(AgentManagerImpl.java:1479)
>         at 
> com.cloud.resource.ResourceManagerImpl.createHostAndAgent(ResourceManagerImpl.java:1739)
>         at 
> com.cloud.resource.ResourceManagerImpl.createHostAndAgent(ResourceManagerImpl.java:1901)
>         at 
> com.cloud.agent.manager.AgentManagerImpl$SimulateStartTask.run(AgentManagerImpl.java:1130)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:679)
> 2013-08-16 05:49:25,773 INFO  [utils.exception.CSExceptionErrorCode] 
> (AgentTaskPool-12:null) Could not find exception: 
> com.cloud.exception.ConnectionException in error code list for exceptions
> 2013-08-16 05:49:25,773 WARN  [agent.manager.AgentManagerImpl] 
> (AgentTaskPool-12:null) Monitor LocalStoragePoolListener says there is an 
> error in the connect process for 13 due to Unable to setup the local storage 
> pool for Host[-13-Routing]
> 2013-08-16 05:49:25,773 INFO  [agent.manager.AgentManagerImpl] 
> (AgentTaskPool-12:null) Host 13 is disconnecting with event AgentDisconnected

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CLOUDSTACK-4371) [Performance Testing] Basic zone with 20K Hosts, management server restart leaves the hosts in disconnected state for very long time

Reply via email to