[
https://issues.apache.org/jira/browse/CLOUDSTACK-2918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13743758#comment-13743758
]
Koushik Das commented on CLOUDSTACK-2918:
-----------------------------------------
Re-test this after the issue in CLOUDSTACK-4371 is fixed. I have updated 4371
with my comments. If the issue exists even after 4371 is addressed please
attach the logs as well.
> In a scaled up environment, hosts fail to come up after Management server
> restart in clustered set up
> -----------------------------------------------------------------------------------------------------
>
> Key: CLOUDSTACK-2918
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-2918
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Components: Management Server
> Affects Versions: 4.2.0
> Environment: Load Test environment with simulator
> Reporter: Sowmya Krishnan
> Assignee: Koushik Das
> Fix For: 4.2.0
>
>
> Scaled setup with 20K simulated hosts and as many VMs, Basic Zone, Clustered
> setup with 3 management servers.
> After restarting one or all the Management servers, re balancing hosts takes
> very long time, and the rebalanced hosts don't come to Up state either.
> Following is the error message from logs when the MS was starting up:
> 2013-05-31 00:00:22,361 DEBUG [cloud.cluster.ClusterManagerImpl]
> (Cluster-Notification-1:null) Joining node, IP: 10.223.48.2, msid:
> 206915885094132
> 2013-05-31 00:00:22,361 DEBUG [cloud.cluster.ClusterManagerImpl]
> (Cluster-Notification-1:null) Joining node, IP: 10.223.48.66, msid:
> 206915885097283
> 2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter]
> (Cluster-Notification-1:null) Receive cluster alert, EventArgs:
> com.cloud.cluster.ClusterNodeJoinEventArgs
> 2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter]
> (Cluster-Notification-1:null) Handle cluster node join alert, joined node:
> 10.223.48.2, msidL: 206915885094132
> 2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter]
> (Cluster-Notification-1:null) Handle cluster node join alert, joined node:
> 10.223.48.66, msidL: 206915885097283
> 2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter]
> (Cluster-Notification-1:null) Management server node 10.223.48.66 is up, send
> alert
> 2013-05-31 00:00:22,539 WARN [cloud.cluster.ClusterManagerImpl]
> (Cluster-Notification-1:null) Notifying management server join event took 178
> ms
> 2013-05-31 00:00:25,825 DEBUG [cloud.cluster.ClusterManagerImpl]
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to
> finish. profiler: Done. Duration: 5583ms, profilerH
> eartbeatUpdate: Done. Duration: 187ms, profilerPeerScan: Done. Duration:
> 560ms, profilerAgentLB: Done. Duration: 4836ms
> 2013-05-31 00:00:25,963 DEBUG [cloud.cluster.ClusterManagerImpl]
> (Cluster-Heartbeat-1:null) Detected management node joined, id:3,
> nodeIP:10.223.48.130
> 2013-05-31 00:00:25,964 DEBUG [cloud.cluster.ClusterManagerImpl]
> (Cluster-Notification-1:null) Notify management server node join to listeners.
> 2013-05-31 00:00:25,964 DEBUG [cloud.cluster.ClusterManagerImpl]
> (Cluster-Notification-1:null) Joining node, IP: 10.223.48.130, msid:
> 206915885093830
> 2013-05-31 00:00:25,964 DEBUG [cloud.alert.ClusterAlertAdapter]
> (Cluster-Notification-1:null) Receive cluster alert, EventArgs:
> com.cloud.cluster.ClusterNodeJoinEventArgs
> 2013-05-31 00:00:25,964 DEBUG [cloud.alert.ClusterAlertAdapter]
> (Cluster-Notification-1:null) Handle cluster node join alert, joined node:
> 10.223.48.130, msidL: 206915885093830
> 2013-05-31 00:00:25,964 WARN [cloud.cluster.ClusterManagerImpl]
> (Cluster-Notification-1:null) Notifying management server join event took 0 ms
> 2013-05-31 00:00:27,501 DEBUG [cloud.cluster.ClusterManagerImpl]
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to
> finish. profiler: Done. Duration: 1675ms, profilerH
> eartbeatUpdate: Done. Duration: 135ms, profilerPeerScan: Done. Duration:
> 179ms, profilerAgentLB: Done. Duration: 1361ms
> 2013-05-31 00:00:29,389 DEBUG [cloud.cluster.ClusterManagerImpl]
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to
> finish. profiler: Done. Duration: 1888ms, profilerH
> eartbeatUpdate: Done. Duration: 462ms, profilerPeerScan: Done. Duration: 2ms,
> profilerAgentLB: Done. Duration: 1424ms
> 2013-05-31 00:00:33,912 DEBUG [cloud.cluster.ClusterManagerImpl]
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to
> finish. profiler: Done. Duration: 1643ms, profilerH
> eartbeatUpdate: Done. Duration: 141ms, profilerPeerScan: Done. Duration: 1ms,
> profilerAgentLB: Done. Duration: 1501ms
> 2013-05-31 00:00:41,328 DEBUG [cloud.cluster.ClusterManagerImpl]
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to
> finish. profiler: Done. Duration: 1675ms, profilerH
> eartbeatUpdate: Done. Duration: 178ms, profilerPeerScan: Done. Duration: 2ms,
> profilerAgentLB: Done. Duration: 1495ms
> 2013-05-31 00:00:48,799 DEBUG [cloud.alert.AlertManagerImpl]
> (CapacityChecker:null) Running Capacity Checker ...
> 2013-05-31 00:00:48,805 DEBUG [cloud.alert.AlertManagerImpl]
> (CapacityChecker:null) recalculating system capacity
> 2013-05-31 00:00:48,805 DEBUG [cloud.alert.AlertManagerImpl]
> (CapacityChecker:null) Executing cpu/ram capacity update
> 2013-05-31 00:00:48,883 DEBUG
> [network.router.VirtualNetworkApplianceManagerImpl]
> (RouterStatusMonitor-1:null) Found 0 routers to update status.
> 2013-05-31 00:00:48,901 DEBUG
> [network.router.VirtualNetworkApplianceManagerImpl]
> (RouterStatusMonitor-1:null) Found 0 networks to update RvR status.
> 2013-05-31 00:00:50,031 DEBUG [cloud.cluster.ClusterManagerImpl]
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to
> finish. profiler: Done. Duration: 1596ms, profilerHeartbeatUpdate: Done.
> Duration: 179ms, profilerPeerScan: Done. Duration: 1ms, profilerAgentLB:
> Done. Duration: 1416ms
> 2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl]
> (CapacityChecker:null) No need to calibrate cpu capacity, host:2 usedCpu: 500
> reservedCpu: 0
> 2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl]
> (CapacityChecker:null) No need to calibrate memory capacity, host:2 usedMem:
> 268435456 reservedMem: 0
> 2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl]
> (CapacityChecker:null) No need to calibrate cpu capacity, host:3 usedCpu: 500
> reservedCpu: 0
> 2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl]
> (CapacityChecker:null) No need to calibrate memory capacity, host:3 usedMem:
> 1073741824 reservedMem: 0
> 2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl]
> (CapacityChecker:null) No need to calibrate cpu capacity, host:5 usedCpu: 0
> reservedCpu: 0
> ...
> ...
> ...
> 2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl]
> (CapacityChecker:null) No need to calibrate memory capacity, host:2 usedMem:
> 268435456 reservedMem: 0
> 2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl]
> (CapacityChecker:null) No need to calibrate cpu capacity, host:3 usedCpu: 500
> reservedCpu: 0
> 2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl]
> (CapacityChecker:null) No need to calibrate memory capacity, host:3 usedMem:
> 1073741824 reservedMem: 0
> 2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl]
> (CapacityChecker:null) No need to calibrate cpu capacity, host:5 usedCpu: 0
> reservedCpu: 0
> 2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl]
> (CapacityChecker:null) No need to calibrate memory capacity, host:5 usedMem:
> 0 reservedMem: 0
> 2013-05-31 00:00:50,414 DEBUG [cloud.capacity.CapacityManagerImpl]
> (CapacityChecker:null) No need to calibrate cpu capacity, host:6 usedCpu: 0
> reservedCpu: 0
> ...
> ...
> ...
> 2013-05-31 00:03:05,249 DEBUG [cloud.capacity.CapacityManagerImpl]
> (CapacityChecker:null) No need to calibrate memory capacity, host:20002
> usedMem: 0 reservedMem: 0
> 2013-05-31 00:03:05,249 DEBUG [cloud.alert.AlertManagerImpl]
> (CapacityChecker:null) Done executing cpu/ram capacity update
> 2013-05-31 00:03:05,250 DEBUG [cloud.alert.AlertManagerImpl]
> (CapacityChecker:null) Executing storage capacity update
> 2013-05-31 00:03:05,890 DEBUG [cloud.storage.StorageManagerImpl]
> (CapacityChecker:null) Successfully set Capacity - 1099511627776 for capacity
> type - 9 , DataCenterId - 1, HostOrPoolId - 20
> 0, PodId 1
> 2013-05-31 00:03:06,080 DEBUG [cloud.storage.StorageManagerImpl]
> (CapacityChecker:null) Successfully set Capacity - 1099511627776 for capacity
> type - 9 , DataCenterId - 1, HostOrPoolId - 20
> 1, PodId 1
> 2013-05-31 00:04:32,680 DEBUG [cloud.alert.AlertManagerImpl]
> (CapacityChecker:null) Done executing storage capacity update
> 2013-05-31 00:04:32,680 DEBUG [cloud.alert.AlertManagerImpl]
> (CapacityChecker:null) Executing capacity updates for public ip and Vlans
> 2013-05-31 00:04:33,488 DEBUG [cloud.cluster.ClusterManagerImpl]
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to
> finish. profiler: Done. Duration: 1565ms, profilerHeartbeatUpdate: Done.
> Duration: 122ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB:
> Done. Duration: 1441ms
> 2013-05-31 00:04:35,070 DEBUG [cloud.cluster.ClusterManagerImpl]
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to
> finish. profiler: Done. Duration: 1582ms, profilerHeartbeatUpdate: Done.
> Duration: 172ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB:
> Done. Duration: 1408ms
> 2013-05-31 00:04:35,206 DEBUG [cloud.alert.AlertManagerImpl]
> (CapacityChecker:null) Done capacity updates for public ip and Vlans
> 2013-05-31 00:04:35,206 DEBUG [cloud.alert.AlertManagerImpl]
> (CapacityChecker:null) Executing capacity updates for private ip
> 2013-05-31 00:04:37,155 DEBUG [cloud.cluster.ClusterManagerImpl]
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to
> finish. profiler: Done. Duration: 2085ms, profilerHeartbeatUpdate: Done.
> Duration: 158ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB:
> Done. Duration: 1925ms
> ...
> ...
> ...
> 2013-05-31 01:56:10,384 DEBUG [cloud.cluster.ClusterManagerImpl]
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to
> finish. profiler: Done. Duration: 2105ms, profilerHeartbeatUpdate: Done.
> Duration: 141ms, profilerPeerScan: Done. Duration: 1ms, profilerAgentLB:
> Done. Duration: 1963ms
> 2013-05-31 01:56:12,043 DEBUG [cloud.cluster.ClusterManagerImpl]
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to
> finish. profiler: Done. Duration: 1659ms, profilerHeartbeatUpdate: Done.
> Duration: 120ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB:
> Done. Duration: 1537ms
> 2013-05-31 01:56:14,172 DEBUG [cloud.cluster.ClusterManagerImpl]
> (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to
> finish. profiler: Done. Duration: 2129ms, profilerHeartbeatUpdate: Done.
> Duration: 96ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done.
> Duration: 2031ms
> All 3 Management servers have the same above message and the hosts continue
> to be in disconnected state
> mysql> select count(*), status, type, mgmt_server_id from host group by
> mgmt_server_id, status, type;
> +----------+--------------+--------------------+-----------------+
> | count(*) | status | type | mgmt_server_id |
> +----------+--------------+--------------------+-----------------+
> | 1 | Alert | SecondaryStorage | NULL |
> | 17854 | Disconnected | Routing | NULL |
> | 1 | Disconnected | SecondaryStorageVM | NULL |
> | 53 | Disconnected | Routing | 206915885093830 |
> | 1485 | Disconnected | Routing | 206915885094132 |
> | 608 | Disconnected | Routing | 206915885097283 |
> +----------+--------------+--------------------+-----------------+
> 6 rows in set (0.07 sec)
> Observed this issue even with lesser hosts ~5K
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira