True, but I thought since mgt sends disconnect to ssvm, I thought it will reset the interface on which is connecting
Sent from my iPhone > On 25-Jul-2019, at 4:57 PM, Andrija Panic <[email protected]> wrote: > > In your previous mail, I understond that you used the OS tool "ping" and > NOT refering to internal ACS pings? > "Out of all these, the ping drops were observed from MGT server to ssvm and > mgt server to nodes. Basically all nodes lost connection. Then it recovered > itself after 1 minute." > > >> On Thu, 25 Jul 2019 at 16:55, Rakesh v <[email protected]> wrote: >> >> The ping between mgt server and ssvm fails because mgt sends disconnect >> message to all nodes. If you look at the logs I pasted in first email, the >> mgt server thinks ssvm is lagging behind on ping and sends a disconnect >> message without investigation for all nodes. Also it happens at the >> beginning of every hour. >> >> >> So I'm sure network is not the issue here. >> >> Sent from my iPhone >> >>> On 25-Jul-2019, at 4:46 PM, Andrija Panic <[email protected]> >> wrote: >>> >>> since basic network connectivity (ping failures) was down between mgmts >> and >>> nodes (and SSVM on it) - I would point my finger to your networking >>> equipment - i.e. I expect zero problems with ACS (since pings fail). >>> >>> Let us know how it goes. >>> >>> Andrija >>> >>>> On Thu, 25 Jul 2019 at 16:04, Rakesh v <[email protected]> >> wrote: >>>> >>>> Yes I was monitoring it continuously. Below are the steps which I was >>>> doing when issue happened >>>> >>>> >>>> 1. Ping from MGT server to ssvm >>>> 2. Ping from ssvm to secondary storage ip >>>> 3. Ping from ssvm to public IP like 8.8.8.8 >>>> 4. Ping from MGT server to node in which ssvm was running >>>> >>>> >>>> Out of all these, the ping drops were observed from MGT server to ssvm >> and >>>> mgt server to nodes. Basically all nodes lost connection. Then it >> recovered >>>> itself after 1 minute. >>>> >>>> >>>> Sent from my iPhone >>>> >>>>> On 25-Jul-2019, at 3:48 PM, Andrija Panic <[email protected]> >>>> wrote: >>>>> >>>>> Can you observe the status of SSVM (is it >>>> UP/Connecting/Disconnected/Down) >>>>> while you have issues? >>>>> >>>>> I would advise checking your Secondary Storage itself - and also >> running >>>>> the SSVM diagnose script /usr/local/cloud/systemvm/ssvm-check.sh - >>>> observe >>>>> if any errors with NFS or others. >>>>> >>>>> Lastly - and don't laugh - check that you don't have issues with >>>> networking >>>>> equipment (some of us had VEEEERY strange issues in connectivity some >>>> years >>>>> ago with crappy QCT/Quanta Switches in MLAG setup) >>>>> >>>>> Andrija >>>>> >>>>>> On Thu, 25 Jul 2019 at 15:42, Rakesh v <[email protected]> >>>> wrote: >>>>>> >>>>>> Yes I have set the ip's of the three MGT servers in the "host" field >>>>>> >>>>>> Sent from my iPhone >>>>>> >>>>>>> On 25-Jul-2019, at 2:14 PM, Pierre-Luc Dion <[email protected]> >>>> wrote: >>>>>>> >>>>>>> Do you have a load balancer in front of cloudstack? Did you set the >>>>>> global >>>>>>> settings "host" to the ip of the mgmt server? >>>>>>> >>>>>>> >>>>>>> Le jeu. 25 juill. 2019 03 h 24, Rakesh Venkatesh < >>>>>> [email protected]> >>>>>>> a écrit : >>>>>>> >>>>>>>> Hello People >>>>>>>> >>>>>>>> >>>>>>>> I have a strange issue where mgt server times out to send a command >> to >>>>>>>> secondary storage VM every hour and because of this UI won't be >>>>>> accessible >>>>>>>> for a short duration of time. Sometimes I have to restart mgt server >>>> to >>>>>> get >>>>>>>> it back to working state and sometimes I don't need to restart it. I >>>>>> also >>>>>>>> see some exceptions while fetching the storage stats. >>>>>>>> >>>>>>>> >>>>>>>> The log says secondary storage VM is lagging behind mgt server in >> ping >>>>>> and >>>>>>>> it sends a disconnect message to other components. Can you let me >> know >>>>>> how >>>>>>>> to troubleshoot this issue? I destroyed the secondary storage VM but >>>> the >>>>>>>> issue still persists. I checked the date/time on the mgt server and >>>> SSVM >>>>>>>> and they are same. This is happening for quite a few days now. Below >>>> are >>>>>>>> the logs >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 2019-07-25 04:01:22,769 INFO [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentMonitor-1:ctx-c33dbe74) (logid:5442158c) Found the following >>>>>> agents >>>>>>>> behind on ping: [183] >>>>>>>> 2019-07-25 04:01:22,775 WARN [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentMonitor-1:ctx-c33dbe74) (logid:5442158c) Disconnect agent for >>>>>>>> CPVM/SSVM due to physical connection close. host: 183 >>>>>>>> 2019-07-25 04:01:22,778 INFO [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Host 183 is >>>>>> disconnecting >>>>>>>> with event ShutdownRequested >>>>>>>> 2019-07-25 04:01:22,781 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) The next status of >>>> agent >>>>>>>> 183is Disconnected, current status is Up >>>>>>>> 2019-07-25 04:01:22,781 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Deregistering link >> for >>>>>> 183 >>>>>>>> with state Disconnected >>>>>>>> 2019-07-25 04:01:22,781 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Remove Agent : 183 >>>>>>>> 2019-07-25 04:01:22,781 DEBUG [c.c.a.m.ConnectedAgentAttache] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Processing >> Disconnect. >>>>>>>> 2019-07-25 04:01:22,782 DEBUG [c.c.a.m.AgentAttache] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Seq >>>>>>>> 183-7541559051008607242: Sending disconnect to class >>>>>>>> com.cloud.agent.manager.SynchronousListener >>>>>>>> 2019-07-25 04:01:22,782 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: >>>> com.cloud.hypervisor.xenserver.discoverer.XcpServerDiscoverer >>>>>>>> 2019-07-25 04:01:22,782 DEBUG [c.c.u.n.NioConnection] >>>>>>>> (pool-2-thread-1:null) (logid:) Closing socket Socket[addr=/ >>>>>> 172.30.32.16 >>>>>>>> ,port=38250,localport=8250] >>>>>>>> 2019-07-25 04:01:22,782 DEBUG [c.c.a.m.AgentAttache] >>>>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Seq >>>>>>>> 183-7541559051008607242: Waiting some more time because this is the >>>>>> current >>>>>>>> command >>>>>>>> 2019-07-25 04:01:22,782 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: >>>> com.cloud.hypervisor.hyperv.discoverer.HypervServerDiscoverer >>>>>>>> 2019-07-25 04:01:22,783 DEBUG [c.c.a.m.AgentAttache] >>>>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Seq >>>>>>>> 183-7541559051008607242: Waiting some more time because this is the >>>>>> current >>>>>>>> command >>>>>>>> 2019-07-25 04:01:22,783 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: com.cloud.deploy.DeploymentPlanningManagerImpl >>>>>>>> 2019-07-25 04:01:22,783 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: com.cloud.network.security.SecurityGroupListener >>>>>>>> 2019-07-25 04:01:22,783 INFO [c.c.u.e.CSExceptionErrorCode] >>>>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Could not find >>>>>> exception: >>>>>>>> com.cloud.exception.OperationTimedoutException in error code list >> for >>>>>>>> exceptions >>>>>>>> 2019-07-25 04:01:22,783 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: >>>> org.apache.cloudstack.engine.orchestration.NetworkOrchestrator >>>>>>>> 2019-07-25 04:01:22,783 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: com.cloud.vm.ClusteredVirtualMachineManagerImpl >>>>>>>> 2019-07-25 04:01:22,783 WARN [c.c.a.m.AgentAttache] >>>>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Seq >>>>>>>> 183-7541559051008607242: Timed out on null >>>>>>>> 2019-07-25 04:01:22,783 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: com.cloud.storage.listener.StoragePoolMonitor >>>>>>>> 2019-07-25 04:01:22,784 DEBUG [c.c.a.m.AgentAttache] >>>>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Seq >>>>>>>> 183-7541559051008607242: Cancelling. >>>>>>>> 2019-07-25 04:01:22,784 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: com.cloud.storage.secondary.SecondaryStorageListener >>>>>>>> 2019-07-25 04:01:22,784 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: com.cloud.network.SshKeysDistriMonitor >>>>>>>> 2019-07-25 04:01:22,785 DEBUG [o.a.c.s.RemoteHostEndPoint] >>>>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Failed to send >>>> command, >>>>>>>> due to Agent:183, com.cloud.exception.OperationTimedoutException: >>>>>> Commands >>>>>>>> 7541559051008607242 to Host 183 timed out after 3600 >>>>>>>> 2019-07-25 04:01:22,785 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: >>>> com.cloud.network.router.VpcVirtualNetworkApplianceManagerImpl >>>>>>>> 2019-07-25 04:01:22,785 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: com.cloud.storage.download.DownloadListener >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 2019-07-25 04:01:22,785 ERROR [c.c.s.StatsCollector] >>>>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Error trying to >>>>>> retrieve >>>>>>>> storage stats >>>>>>>> com.cloud.utils.exception.CloudRuntimeException: Failed to send >>>> command, >>>>>>>> due to Agent:183, com.cloud.exception.OperationTimedoutException: >>>>>> Commands >>>>>>>> 7541559051008607242 to Host 183 timed out after 3600 >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>> >>>> >> org.apache.cloudstack.storage.RemoteHostEndPoint.sendMessage(RemoteHostEndPoint.java:133) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>> >>>> >> com.cloud.server.StatsCollector$StorageCollector.runInContext(StatsCollector.java:1139) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>> >>>> >> org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>> >>>> >> org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>> >>>> >> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>> >>>> >> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>> >>>> >> org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) >>>>>>>> at >>>>>>>> >>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>>>>>>> at >>>>>> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>> >>>> >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>> >>>> >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>> >>>> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>> >>>> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>>>>>> at java.lang.Thread.run(Thread.java:748) >>>>>>>> 2019-07-25 04:01:22,786 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: com.cloud.consoleproxy.ConsoleProxyListener >>>>>>>> 2019-07-25 04:01:22,789 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: com.cloud.storage.LocalStoragePoolListener >>>>>>>> 2019-07-25 04:01:22,789 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: com.cloud.storage.upload.UploadListener >>>>>>>> 2019-07-25 04:01:22,790 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: com.cloud.capacity.StorageCapacityListener >>>>>>>> 2019-07-25 04:01:22,790 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: com.cloud.capacity.ComputeCapacityListener >>>>>>>> 2019-07-25 04:01:22,790 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: com.cloud.network.SshKeysDistriMonitor >>>>>>>> 2019-07-25 04:01:22,791 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: >> com.cloud.network.router.VirtualNetworkApplianceManagerImpl >>>>>>>> 2019-07-25 04:01:22,791 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: >>>>>>>> com.cloud.network.NetworkUsageManagerImpl$DirectNetworkStatsListener >>>>>>>> 2019-07-25 04:01:22,791 DEBUG [c.c.n.NetworkUsageManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Disconnected called >> on >>>>>> 183 >>>>>>>> with status Disconnected >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 2019-07-25 04:01:22,791 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: >>>> com.cloud.agent.manager.AgentManagerImpl$BehindOnPingListener >>>>>>>> 2019-07-25 04:01:22,791 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect >> to >>>>>>>> listener: >>>> com.cloud.agent.manager.AgentManagerImpl$SetHostParamsListener >>>>>>>> 2019-07-25 04:01:22,791 DEBUG [c.c.h.Status] >>>>>> (AgentTaskPool-1:ctx-66de2057) >>>>>>>> (logid:841d2a63) Transition:[Resource state = Enabled, Agent event = >>>>>>>> ShutdownRequested, Host id = 183, name = s-2775-VM] >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Thanks and regards >>>>>>>> Rakesh venkatesh >>>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Andrija Panić >>>> >>> >>> >>> -- >>> >>> Andrija Panić >> > > > -- > > Andrija Panić
