[jira] [Commented] (CLOUDSTACK-2918) In a scaled up environment, hosts fail to come up after Management server restart in clustered set up

Koushik Das (JIRA) Mon, 19 Aug 2013 05:05:32 -0700

    [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-2918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13743758#comment-13743758
 ]


Koushik Das commented on CLOUDSTACK-2918:
-----------------------------------------

Re-test this after the issue in CLOUDSTACK-4371 is fixed. I have updated 4371 
with my comments. If the issue exists even after 4371 is addressed please 
attach the logs as well.
                
> In a scaled up environment, hosts fail to come up after Management server 
> restart in clustered set up
> -----------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-2918
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-2918
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>          Components: Management Server
>    Affects Versions: 4.2.0
>         Environment: Load Test environment with simulator
>            Reporter: Sowmya Krishnan
>            Assignee: Koushik Das
>             Fix For: 4.2.0
>
>
> Scaled setup with 20K simulated hosts and as many VMs, Basic Zone, Clustered 
> setup with 3 management servers.
> After restarting one or all the Management servers, re balancing hosts takes 
> very long time, and the rebalanced hosts don't come to Up state either.
> Following is the error message from logs when the MS was starting up:
> 2013-05-31 00:00:22,361 DEBUG [cloud.cluster.ClusterManagerImpl] 
> (Cluster-Notification-1:null) Joining node, IP: 10.223.48.2, msid: 
> 206915885094132
> 2013-05-31 00:00:22,361 DEBUG [cloud.cluster.ClusterManagerImpl] 
> (Cluster-Notification-1:null) Joining node, IP: 10.223.48.66, msid: 
> 206915885097283
> 2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] 
> (Cluster-Notification-1:null) Receive cluster alert, EventArgs: 
> com.cloud.cluster.ClusterNodeJoinEventArgs
> 2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] 
> (Cluster-Notification-1:null) Handle cluster node join alert, joined node: 
> 10.223.48.2, msidL: 206915885094132
> 2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] 
> (Cluster-Notification-1:null) Handle cluster node join alert, joined node: 
> 10.223.48.66, msidL: 206915885097283
> 2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] 
> (Cluster-Notification-1:null) Management server node 10.223.48.66 is up, send 
> alert
> 2013-05-31 00:00:22,539 WARN [cloud.cluster.ClusterManagerImpl] 
> (Cluster-Notification-1:null) Notifying management server join event took 178 
> ms
> 2013-05-31 00:00:25,825 DEBUG [cloud.cluster.ClusterManagerImpl] 
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to 
> finish. profiler: Done. Duration: 5583ms, profilerH
> eartbeatUpdate: Done. Duration: 187ms, profilerPeerScan: Done. Duration: 
> 560ms, profilerAgentLB: Done. Duration: 4836ms
> 2013-05-31 00:00:25,963 DEBUG [cloud.cluster.ClusterManagerImpl] 
> (Cluster-Heartbeat-1:null) Detected management node joined, id:3, 
> nodeIP:10.223.48.130
> 2013-05-31 00:00:25,964 DEBUG [cloud.cluster.ClusterManagerImpl] 
> (Cluster-Notification-1:null) Notify management server node join to listeners.
> 2013-05-31 00:00:25,964 DEBUG [cloud.cluster.ClusterManagerImpl] 
> (Cluster-Notification-1:null) Joining node, IP: 10.223.48.130, msid: 
> 206915885093830
> 2013-05-31 00:00:25,964 DEBUG [cloud.alert.ClusterAlertAdapter] 
> (Cluster-Notification-1:null) Receive cluster alert, EventArgs: 
> com.cloud.cluster.ClusterNodeJoinEventArgs
> 2013-05-31 00:00:25,964 DEBUG [cloud.alert.ClusterAlertAdapter] 
> (Cluster-Notification-1:null) Handle cluster node join alert, joined node: 
> 10.223.48.130, msidL: 206915885093830
> 2013-05-31 00:00:25,964 WARN [cloud.cluster.ClusterManagerImpl] 
> (Cluster-Notification-1:null) Notifying management server join event took 0 ms
> 2013-05-31 00:00:27,501 DEBUG [cloud.cluster.ClusterManagerImpl] 
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to 
> finish. profiler: Done. Duration: 1675ms, profilerH
> eartbeatUpdate: Done. Duration: 135ms, profilerPeerScan: Done. Duration: 
> 179ms, profilerAgentLB: Done. Duration: 1361ms
> 2013-05-31 00:00:29,389 DEBUG [cloud.cluster.ClusterManagerImpl] 
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to 
> finish. profiler: Done. Duration: 1888ms, profilerH
> eartbeatUpdate: Done. Duration: 462ms, profilerPeerScan: Done. Duration: 2ms, 
> profilerAgentLB: Done. Duration: 1424ms
> 2013-05-31 00:00:33,912 DEBUG [cloud.cluster.ClusterManagerImpl] 
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to 
> finish. profiler: Done. Duration: 1643ms, profilerH
> eartbeatUpdate: Done. Duration: 141ms, profilerPeerScan: Done. Duration: 1ms, 
> profilerAgentLB: Done. Duration: 1501ms
> 2013-05-31 00:00:41,328 DEBUG [cloud.cluster.ClusterManagerImpl] 
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to 
> finish. profiler: Done. Duration: 1675ms, profilerH
> eartbeatUpdate: Done. Duration: 178ms, profilerPeerScan: Done. Duration: 2ms, 
> profilerAgentLB: Done. Duration: 1495ms
> 2013-05-31 00:00:48,799 DEBUG [cloud.alert.AlertManagerImpl] 
> (CapacityChecker:null) Running Capacity Checker ...
> 2013-05-31 00:00:48,805 DEBUG [cloud.alert.AlertManagerImpl] 
> (CapacityChecker:null) recalculating system capacity
> 2013-05-31 00:00:48,805 DEBUG [cloud.alert.AlertManagerImpl] 
> (CapacityChecker:null) Executing cpu/ram capacity update
> 2013-05-31 00:00:48,883 DEBUG 
> [network.router.VirtualNetworkApplianceManagerImpl] 
> (RouterStatusMonitor-1:null) Found 0 routers to update status.
> 2013-05-31 00:00:48,901 DEBUG 
> [network.router.VirtualNetworkApplianceManagerImpl] 
> (RouterStatusMonitor-1:null) Found 0 networks to update RvR status.
> 2013-05-31 00:00:50,031 DEBUG [cloud.cluster.ClusterManagerImpl] 
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to 
> finish. profiler: Done. Duration: 1596ms, profilerHeartbeatUpdate: Done. 
> Duration: 179ms, profilerPeerScan: Done. Duration: 1ms, profilerAgentLB: 
> Done. Duration: 1416ms
> 2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl] 
> (CapacityChecker:null) No need to calibrate cpu capacity, host:2 usedCpu: 500 
> reservedCpu: 0
> 2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl] 
> (CapacityChecker:null) No need to calibrate memory capacity, host:2 usedMem: 
> 268435456 reservedMem: 0
> 2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] 
> (CapacityChecker:null) No need to calibrate cpu capacity, host:3 usedCpu: 500 
> reservedCpu: 0
> 2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] 
> (CapacityChecker:null) No need to calibrate memory capacity, host:3 usedMem: 
> 1073741824 reservedMem: 0
> 2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl] 
> (CapacityChecker:null) No need to calibrate cpu capacity, host:5 usedCpu: 0 
> reservedCpu: 0
> ...
> ...
> ...
> 2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl] 
> (CapacityChecker:null) No need to calibrate memory capacity, host:2 usedMem: 
> 268435456 reservedMem: 0
> 2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] 
> (CapacityChecker:null) No need to calibrate cpu capacity, host:3 usedCpu: 500 
> reservedCpu: 0
> 2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] 
> (CapacityChecker:null) No need to calibrate memory capacity, host:3 usedMem: 
> 1073741824 reservedMem: 0
> 2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl] 
> (CapacityChecker:null) No need to calibrate cpu capacity, host:5 usedCpu: 0 
> reservedCpu: 0
> 2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl] 
> (CapacityChecker:null) No need to calibrate memory capacity, host:5 usedMem: 
> 0 reservedMem: 0
> 2013-05-31 00:00:50,414 DEBUG [cloud.capacity.CapacityManagerImpl] 
> (CapacityChecker:null) No need to calibrate cpu capacity, host:6 usedCpu: 0 
> reservedCpu: 0
> ...
> ...
> ...
> 2013-05-31 00:03:05,249 DEBUG [cloud.capacity.CapacityManagerImpl] 
> (CapacityChecker:null) No need to calibrate memory capacity, host:20002 
> usedMem: 0 reservedMem: 0
> 2013-05-31 00:03:05,249 DEBUG [cloud.alert.AlertManagerImpl] 
> (CapacityChecker:null) Done executing cpu/ram capacity update
> 2013-05-31 00:03:05,250 DEBUG [cloud.alert.AlertManagerImpl] 
> (CapacityChecker:null) Executing storage capacity update
> 2013-05-31 00:03:05,890 DEBUG [cloud.storage.StorageManagerImpl] 
> (CapacityChecker:null) Successfully set Capacity - 1099511627776 for capacity 
> type - 9 , DataCenterId - 1, HostOrPoolId - 20
> 0, PodId 1
> 2013-05-31 00:03:06,080 DEBUG [cloud.storage.StorageManagerImpl] 
> (CapacityChecker:null) Successfully set Capacity - 1099511627776 for capacity 
> type - 9 , DataCenterId - 1, HostOrPoolId - 20
> 1, PodId 1
> 2013-05-31 00:04:32,680 DEBUG [cloud.alert.AlertManagerImpl] 
> (CapacityChecker:null) Done executing storage capacity update
> 2013-05-31 00:04:32,680 DEBUG [cloud.alert.AlertManagerImpl] 
> (CapacityChecker:null) Executing capacity updates for public ip and Vlans
> 2013-05-31 00:04:33,488 DEBUG [cloud.cluster.ClusterManagerImpl] 
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to 
> finish. profiler: Done. Duration: 1565ms, profilerHeartbeatUpdate: Done. 
> Duration: 122ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: 
> Done. Duration: 1441ms
> 2013-05-31 00:04:35,070 DEBUG [cloud.cluster.ClusterManagerImpl] 
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to 
> finish. profiler: Done. Duration: 1582ms, profilerHeartbeatUpdate: Done. 
> Duration: 172ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: 
> Done. Duration: 1408ms
> 2013-05-31 00:04:35,206 DEBUG [cloud.alert.AlertManagerImpl] 
> (CapacityChecker:null) Done capacity updates for public ip and Vlans
> 2013-05-31 00:04:35,206 DEBUG [cloud.alert.AlertManagerImpl] 
> (CapacityChecker:null) Executing capacity updates for private ip
> 2013-05-31 00:04:37,155 DEBUG [cloud.cluster.ClusterManagerImpl] 
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to 
> finish. profiler: Done. Duration: 2085ms, profilerHeartbeatUpdate: Done. 
> Duration: 158ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: 
> Done. Duration: 1925ms
> ...
> ...
> ...
> 2013-05-31 01:56:10,384 DEBUG [cloud.cluster.ClusterManagerImpl] 
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to 
> finish. profiler: Done. Duration: 2105ms, profilerHeartbeatUpdate: Done. 
> Duration: 141ms, profilerPeerScan: Done. Duration: 1ms, profilerAgentLB: 
> Done. Duration: 1963ms
> 2013-05-31 01:56:12,043 DEBUG [cloud.cluster.ClusterManagerImpl] 
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to 
> finish. profiler: Done. Duration: 1659ms, profilerHeartbeatUpdate: Done. 
> Duration: 120ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: 
> Done. Duration: 1537ms
> 2013-05-31 01:56:14,172 DEBUG [cloud.cluster.ClusterManagerImpl] 
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to 
> finish. profiler: Done. Duration: 2129ms, profilerHeartbeatUpdate: Done. 
> Duration: 96ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. 
> Duration: 2031ms
> All 3 Management servers have the same above message and the hosts continue 
> to be in disconnected state
> mysql> select count(*), status, type, mgmt_server_id from host group by 
> mgmt_server_id, status, type;
> +----------+--------------+--------------------+-----------------+
> | count(*) | status | type | mgmt_server_id |
> +----------+--------------+--------------------+-----------------+
> | 1 | Alert | SecondaryStorage | NULL |
> | 17854 | Disconnected | Routing | NULL |
> | 1 | Disconnected | SecondaryStorageVM | NULL |
> | 53 | Disconnected | Routing | 206915885093830 |
> | 1485 | Disconnected | Routing | 206915885094132 |
> | 608 | Disconnected | Routing | 206915885097283 |
> +----------+--------------+--------------------+-----------------+
> 6 rows in set (0.07 sec)
> Observed this issue even with lesser hosts ~5K 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CLOUDSTACK-2918) In a scaled up environment, hosts fail to come up after Management server restart in clustered set up

Reply via email to