Aled Sage created BROOKLYN-434:
----------------------------------
Summary: Provisioning of CentOS 7.3 (via jclouds) can leave sshd
in bad state
Key: BROOKLYN-434
URL: https://issues.apache.org/jira/browse/BROOKLYN-434
Project: Brooklyn
Issue Type: Bug
Affects Versions: 0.10.0
Reporter: Aled Sage
**Underlying Problem**
There is a bug in CentOS 7.3 (and RHEL 7.3) related to reloading of the SSH
daemon configuration. See:
* RHEL bug report at https://bugzilla.redhat.com/show_bug.cgi?id=1381997 (which
gives the best description of the underlying problem in
https://bugzilla.redhat.com/show_bug.cgi?id=1381997#c4).
* CentOS bug report at https://bugs.centos.org/view.php?id=12757
* The change introduced the problematic behavior at
https://bugzilla.redhat.com/show_bug.cgi?id=1291172
**Brooklyn Code calling sshd reload**
The problems occurs when jclouds (or
{{org.apache.brooklyn.policy.jclouds.os.CreateUserPolicy}}) modifies
{{/etc/ssh/sshd_config}} and then calls {{service sshd reload}}. This can leave
systemd's view of sshd in a bad state. A subsequent call to {{service sshd
reload}} or {{service sshd restart}} will not have the desired effect!
The jclouds code calls {{service sshd reload}} if it is configured to lock-down
sshd, which is the default - it will configure {{PasswordAuthentication no}}
and {{PermitRootLogin no}} (see
https://github.com/apache/brooklyn-server/blob/rel/apache-brooklyn-0.10.0/locations/jclouds/src/main/java/org/apache/brooklyn/location/jclouds/JcloudsLocation.java#L2293).
Subsequent use of {{CreateUserPolicy}} will also call {{service sshd reload}}
(see
https://github.com/apache/brooklyn-server/blob/rel/apache-brooklyn-0.10.0/locations/jclouds/src/main/java/org/apache/brooklyn/policy/jclouds/os/CreateUserPolicy.java#L159-L164).
**Observed Behaviour: initial jclouds provisioning**
The exact behaviour we see on CentOS 7.3 is non-deterministic. Below we attempt
to describe what can happen.
*TL;DR: systemd thinks sshd is inactive/dead, but the process is really
running.*
On a newly provisioned CentOS 7.3 VM, immediately after we've provisioned it
through jclouds...
Looking at the output from jclouds (having tweaked it so that the stderr isn't
lost), it executed:
{noformat}
exec 3<> /etc/ssh/sshd_config && awk -v TEXT="PasswordAuthentication no
PermitRootLogin no
" 'BEGIN {print TEXT}{print}' /etc/ssh/sshd_config >&3
hash service && service ssh reload || service sshd reload || /etc/init.d/ssh*
reload
{noformat}
The relevant stderr is:
{noformat}
Redirecting to /bin/systemctl reload sshd.service
Redirecting to /bin/systemctl status sshd.service
{noformat}
The status of sshd (according the systemd) is:
{noformat}
$ service sshd status
Redirecting to /bin/systemctl status sshd.service
* sshd.service - OpenSSH server daemon
Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor
preset: enabled)
Active: inactive (dead) since Wed 2017-02-08 15:16:18 UTC; 5s ago
Docs: man:sshd(8)
man:sshd_config(5)
Process: 28941 ExecReload=/bin/kill -HUP $MAINPID (code=exited,
status=0/SUCCESS)
Process: 10346 ExecStart=/usr/sbin/sshd $OPTIONS (code=exited,
status=0/SUCCESS)
Main PID: 10348 (code=exited, status=0/SUCCESS)
CGroup: /system.slice/sshd.service
└─28942 /usr/sbin/sshd
{noformat}
However, the process is really running and we can ssh to it:
{noformat}
$ cat /var/run/sshd.pid
28942
$ ps aux | grep "/usr/sbin/sshd"
root 28942 0.0 0.0 82468 1260 ? Ss 15:16 0:00 /usr/sbin/sshd
{noformat}
**Observed Behaviour: sshd reload again**
*TL;DR: running reload again fails - the process keeps running, but does not
re-load the sshd_config file.*
After running {{service sshd reload}}, if the same command is run again we get:
{noformat}
$ sudo service sshd reload
Redirecting to /bin/systemctl reload sshd.service
Job for sshd.service invalid.
$ echo $?
1
$ cat /var/run/sshd.pid
28942
$ ps aux | grep "/usr/sbin/sshd"
root 28942 0.0 0.0 82468 1260 ? Ss 15:16 0:00 /usr/sbin/sshd
$ service sshd status
Redirecting to /bin/systemctl status sshd.service
* sshd.service - OpenSSH server daemon
Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor
preset: enabled)
Active: inactive (dead) since Wed 2017-02-08 15:16:18 UTC; 1min 9s ago
Docs: man:sshd(8)
man:sshd_config(5)
Process: 28941 ExecReload=/bin/kill -HUP $MAINPID (code=exited,
status=0/SUCCESS)
Process: 10346 ExecStart=/usr/sbin/sshd $OPTIONS (code=exited,
status=0/SUCCESS)
Main PID: 10348 (code=exited, status=0/SUCCESS)
CGroup: /system.slice/sshd.service
└─28942 /usr/sbin/sshd
{noformat}
In {{/var/log/messages}}, we have:
{noformat}
Feb 8 15:17:23 host-10-104-0-188 systemd: Unit sshd.service cannot be reloaded
because it is inactive.
{noformat}
**Observed Behaviour: restart, after a failed reload**
*TL;DR: systemd does not restart the process (same pid) - the existing process
is now recognised as running though, but the {{/etc/ssh/sshd_config}} has not
be reloaded.*
{noformat}
$ sudo service sshd restart
Redirecting to /bin/systemctl restart sshd.service
$ echo $?
0
$ cat /var/run/sshd.pid
28942
$ ps aux | grep "/usr/sbin/sshd"
root 28942 0.0 0.0 82468 1260 ? Ss 15:16 0:00 /usr/sbin/sshd
$ service sshd status
Redirecting to /bin/systemctl status sshd.service
* sshd.service - OpenSSH server daemon
Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor
preset: enabled)
Active: active (running) since Wed 2017-02-08 15:21:39 UTC; 5s ago
Docs: man:sshd(8)
man:sshd_config(5)
Process: 28941 ExecReload=/bin/kill -HUP $MAINPID (code=exited,
status=0/SUCCESS)
Process: 29726 ExecStart=/usr/sbin/sshd $OPTIONS (code=exited,
status=0/SUCCESS)
Main PID: 28942 (sshd)
CGroup: /system.slice/sshd.service
└─28942 /usr/sbin/sshd
Feb 08 15:21:39 qa-machine-en-fcap6ij4ya-7bf systemd[1]: Starting OpenSSH
server daemon...
Feb 08 15:21:39 qa-machine-en-fcap6ij4ya-7bf systemd[1]: Started OpenSSH server
daemon.
{noformat}
In {{/var/log/messages}}, we have:
{noformat}
Feb 8 15:21:39 host-10-104-0-188 systemd: Starting OpenSSH server daemon...
Feb 8 15:21:39 host-10-104-0-188 systemd: Started OpenSSH server daemon.
{noformat}
And in {{/var/log/secure}}:
{noformat}
Feb 8 15:21:39 host-10-104-0-188 sudo: amp : TTY=pts/0 ;
PWD=/home/users/amp ; USER=root ; COMMAND=/usr/sbin/service sshd restart
Feb 8 15:21:39 host-10-104-0-188 polkitd[497]: Registered Authentication Agent
for unix-process:29709:569722 (system bus name :1.29 [/usr/bin/pkttyagent
--notify-fd 5 --fallback], object path
/org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_GB.UTF-8)
Feb 8 15:21:39 host-10-104-0-188 polkitd[497]: Unregistered Authentication
Agent for unix-process:29709:569722 (system bus name :1.29, object path
/org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_GB.UTF-8)
(disconnected from bus)
Feb 8 15:21:39 host-10-104-0-188 sshd[29727]: error: Bind to port 22 on
0.0.0.0 failed: Address already in use.
Feb 8 15:21:39 host-10-104-0-188 sshd[29727]: error: Bind to port 22 on ::
failed: Address already in use.
Feb 8 15:21:39 host-10-104-0-188 sshd[29727]: fatal: Cannot bind any address.
{noformat}
My wild guess at what has happened here is...
1. systemd thinks sshd is not running, so did not stop it.
2. systemd starts another sshd process, which fails because process 28942 is
already running and listening on port 22.
3. systemd looks at the pid file, and finds 28942 written there - that sshd
process is running, so it reports sshd as "active (running)".
The sshd process (28942) has *not* re-loaded the sshd_config file, so changes
will not have taken effect.
**Observed Behaviour: stop + start, after a failed reload**
Note that calling {{service sshd stop; service sshd start}} after a bad reload
will behave in the same way as the {{service sshd restart}}.
**Observed Behaviour: subsequent restart, after failed restart**
A subsequent call to restart now succeeds as expected. The process is replaced,
and the changes made in sshd_config take effect.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)