Abhinandan Prateek created CLOUDSTACK-9857:
----------------------------------------------
Summary: CloudStack KVM Agent Self Fencing - improper systemd
config
Key: CLOUDSTACK-9857
URL: https://issues.apache.org/jira/browse/CLOUDSTACK-9857
Project: CloudStack
Issue Type: Bug
Security Level: Public (Anyone can view this level - this is the default.)
Components: KVM
Affects Versions: 4.5.2
Reporter: Abhinandan Prateek
Assignee: Abhinandan Prateek
Priority: Critical
Fix For: 4.10.0.0
We had a database outage few days ago, we noticed that most of cloudstack KVM
agents committed a suicide and never retried to connect. Moreover - we had
puppet - that was suppose to restart cloudstack-agent daemon when it goes into
failed, but apparently it never does go to “failed” state.
2017-03-30 04:07:50,720 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:null)
Request:Seq -1--1: { Cmd , MgmtId: -1, via: -1, Ver: v1, Flags: 111,
[{"com.cloud.agent.api.ReadyCommand":{"_details":"com.cloud.utils.exception.CloudRuntimeException:
DB Exception on: null","wait":0}}] }
2017-03-30 04:07:50,721 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:null)
Processing command: com.cloud.agent.api.ReadyCommand
2017-03-30 04:07:50,721 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:null)
Not ready to connect to mgt server:
com.cloud.utils.exception.CloudRuntimeException: DB Exception on: null
2017-03-30 04:07:50,722 INFO [cloud.agent.Agent] (AgentShutdownThread:null)
Stopping the agent: Reason = sig.kill
2017-03-30 04:07:50,723 DEBUG [cloud.agent.Agent] (AgentShutdownThread:null)
Sending shutdown to management server
While agent fenced itself for whatever logic reason it had - the systemd agent
did not exit properly.
Here what the status of the cloudstack-agent looks like
[root@mqa6-kvm02 ~]# service cloudstack-agent status
● cloudstack-agent.service - SYSV: Cloud Agent
Loaded: loaded (/etc/rc.d/init.d/cloudstack-agent)
Active: active (exited) since Fri 2017-03-31 23:50:47 GMT; 12s ago
Docs: man:systemd-sysv-generator(8)
Process: 632 ExecStop=/etc/rc.d/init.d/cloudstack-agent stop (code=exited,
status=0/SUCCESS)
Process: 654 ExecStart=/etc/rc.d/init.d/cloudstack-agent start (code=exited,
status=0/SUCCESS)
Main PID: 441
Mar 31 23:50:47 mqa6-kvm02 systemd[1]: Starting SYSV: Cloud Agent...
Mar 31 23:50:47 mqa6-kvm02 cloudstack-agent[654]: Starting Cloud Agent:
Mar 31 23:50:47 mqa6-kvm02 systemd[1]: Started SYSV: Cloud Agent.
Mar 31 23:50:49 mqa6-kvm02 sudo[806]: root : TTY=unknown ; PWD=/ ;
USER=root ; COMMAND=/bin/grep InitiatorName= /etc/iscsi/initiatorname.iscsi
The "Active: active (exited)" should be "Active: failed (Result: exit-code)”
Solution:
The fix is to add pidfile into /etc/init.d/cloudstack-agent
Like so:
# chkconfig: 35 99 10
# description: Cloud Agent
+ # pidfile: /var/run/cloudstack-agent.pid
Post that - if agent dies - the systemd will catch it properly and it will look
as expected
[root@mqa6-kvm02 ~]# service cloudstack-agent status
● cloudstack-agent.service - SYSV: Cloud Agent
Loaded: loaded (/etc/rc.d/init.d/cloudstack-agent)
Active: failed (Result: exit-code) since Fri 2017-03-31 23:51:40 GMT; 7s ago
Docs: man:systemd-sysv-generator(8)
Process: 1124 ExecStop=/etc/rc.d/init.d/cloudstack-agent stop (code=exited,
status=255)
Process: 949 ExecStart=/etc/rc.d/init.d/cloudstack-agent start (code=exited,
status=0/SUCCESS)
Main PID: 975
With this change - some other tool can properly inspect the state of daemon and
take actions when it failed instead of it being in active (exited) state.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)