Aled Sage created BROOKLYN-434:
----------------------------------

             Summary: Provisioning of CentOS 7.3 (via jclouds) can leave sshd 
in bad state
                 Key: BROOKLYN-434
                 URL: https://issues.apache.org/jira/browse/BROOKLYN-434
             Project: Brooklyn
          Issue Type: Bug
    Affects Versions: 0.10.0
            Reporter: Aled Sage


**Underlying Problem**

There is a bug in CentOS 7.3 (and RHEL 7.3) related to reloading of the SSH 
daemon configuration. See:
* RHEL bug report at https://bugzilla.redhat.com/show_bug.cgi?id=1381997 (which 
gives the best description of the underlying problem in 
https://bugzilla.redhat.com/show_bug.cgi?id=1381997#c4).
* CentOS bug report at https://bugs.centos.org/view.php?id=12757
* The change introduced the problematic behavior at 
https://bugzilla.redhat.com/show_bug.cgi?id=1291172

**Brooklyn Code calling sshd reload**

The problems occurs when jclouds (or 
{{org.apache.brooklyn.policy.jclouds.os.CreateUserPolicy}}) modifies 
{{/etc/ssh/sshd_config}} and then calls {{service sshd reload}}. This can leave 
systemd's view of sshd in a bad state. A subsequent call to {{service sshd 
reload}} or {{service sshd restart}} will not have the desired effect!

The jclouds code calls {{service sshd reload}} if it is configured to lock-down 
sshd, which is the default - it will configure {{PasswordAuthentication no}} 
and {{PermitRootLogin no}} (see 
https://github.com/apache/brooklyn-server/blob/rel/apache-brooklyn-0.10.0/locations/jclouds/src/main/java/org/apache/brooklyn/location/jclouds/JcloudsLocation.java#L2293).

Subsequent use of {{CreateUserPolicy}} will also call {{service sshd reload}} 
(see 
https://github.com/apache/brooklyn-server/blob/rel/apache-brooklyn-0.10.0/locations/jclouds/src/main/java/org/apache/brooklyn/policy/jclouds/os/CreateUserPolicy.java#L159-L164).


**Observed Behaviour: initial jclouds provisioning**

The exact behaviour we see on CentOS 7.3 is non-deterministic. Below we attempt 
to describe what can happen.

*TL;DR: systemd thinks sshd is inactive/dead, but the process is really 
running.*

On a newly provisioned CentOS 7.3 VM, immediately after we've provisioned it 
through jclouds...

Looking at the output from jclouds (having tweaked it so that the stderr isn't 
lost), it executed:

{noformat}
exec 3<> /etc/ssh/sshd_config && awk -v TEXT="PasswordAuthentication no
PermitRootLogin no
" 'BEGIN {print TEXT}{print}' /etc/ssh/sshd_config >&3
hash service && service ssh reload || service sshd reload || /etc/init.d/ssh* 
reload
{noformat}

The relevant stderr is:
{noformat}
Redirecting to /bin/systemctl reload  sshd.service
Redirecting to /bin/systemctl status  sshd.service
{noformat}

The status of sshd (according the systemd) is:

{noformat}
$ service sshd status

Redirecting to /bin/systemctl status  sshd.service
* sshd.service - OpenSSH server daemon
   Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor 
preset: enabled)
   Active: inactive (dead) since Wed 2017-02-08 15:16:18 UTC; 5s ago
     Docs: man:sshd(8)
           man:sshd_config(5)
  Process: 28941 ExecReload=/bin/kill -HUP $MAINPID (code=exited, 
status=0/SUCCESS)
  Process: 10346 ExecStart=/usr/sbin/sshd $OPTIONS (code=exited, 
status=0/SUCCESS)
 Main PID: 10348 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/sshd.service
           └─28942 /usr/sbin/sshd
{noformat}

However, the process is really running and we can ssh to it:
{noformat}
$ cat /var/run/sshd.pid
28942

$ ps aux | grep "/usr/sbin/sshd"
root     28942  0.0  0.0  82468  1260 ?        Ss   15:16   0:00 /usr/sbin/sshd
{noformat}


**Observed Behaviour: sshd reload again**

*TL;DR: running reload again fails - the process keeps running, but does not 
re-load the sshd_config file.*

After running {{service sshd reload}}, if the same command is run again we get:

{noformat}
$ sudo service sshd reload
Redirecting to /bin/systemctl reload  sshd.service
Job for sshd.service invalid.

$ echo $?
1

$ cat /var/run/sshd.pid
28942

$ ps aux | grep "/usr/sbin/sshd"
root     28942  0.0  0.0  82468  1260 ?        Ss   15:16   0:00 /usr/sbin/sshd

$ service sshd status
Redirecting to /bin/systemctl status  sshd.service
* sshd.service - OpenSSH server daemon
   Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor 
preset: enabled)
   Active: inactive (dead) since Wed 2017-02-08 15:16:18 UTC; 1min 9s ago
     Docs: man:sshd(8)
           man:sshd_config(5)
  Process: 28941 ExecReload=/bin/kill -HUP $MAINPID (code=exited, 
status=0/SUCCESS)
  Process: 10346 ExecStart=/usr/sbin/sshd $OPTIONS (code=exited, 
status=0/SUCCESS)
 Main PID: 10348 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/sshd.service
           └─28942 /usr/sbin/sshd
{noformat}

In {{/var/log/messages}}, we have:

{noformat}
Feb  8 15:17:23 host-10-104-0-188 systemd: Unit sshd.service cannot be reloaded 
because it is inactive.
{noformat}


**Observed Behaviour: restart, after a failed reload**

*TL;DR: systemd does not restart the process (same pid) - the existing process 
is now recognised as running though, but the {{/etc/ssh/sshd_config}} has not 
be reloaded.*

{noformat}
$ sudo service sshd restart
Redirecting to /bin/systemctl restart  sshd.service

$ echo $?
0

$ cat /var/run/sshd.pid
28942

$ ps aux | grep "/usr/sbin/sshd"
root     28942  0.0  0.0  82468  1260 ?        Ss   15:16   0:00 /usr/sbin/sshd

$ service sshd status
Redirecting to /bin/systemctl status  sshd.service
* sshd.service - OpenSSH server daemon
   Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor 
preset: enabled)
   Active: active (running) since Wed 2017-02-08 15:21:39 UTC; 5s ago
     Docs: man:sshd(8)
           man:sshd_config(5)
  Process: 28941 ExecReload=/bin/kill -HUP $MAINPID (code=exited, 
status=0/SUCCESS)
  Process: 29726 ExecStart=/usr/sbin/sshd $OPTIONS (code=exited, 
status=0/SUCCESS)
 Main PID: 28942 (sshd)
   CGroup: /system.slice/sshd.service
           └─28942 /usr/sbin/sshd

Feb 08 15:21:39 qa-machine-en-fcap6ij4ya-7bf systemd[1]: Starting OpenSSH 
server daemon...
Feb 08 15:21:39 qa-machine-en-fcap6ij4ya-7bf systemd[1]: Started OpenSSH server 
daemon.
{noformat}

In {{/var/log/messages}}, we have:

{noformat}
Feb  8 15:21:39 host-10-104-0-188 systemd: Starting OpenSSH server daemon...
Feb  8 15:21:39 host-10-104-0-188 systemd: Started OpenSSH server daemon.
{noformat}

And in {{/var/log/secure}}:

{noformat}
Feb  8 15:21:39 host-10-104-0-188 sudo:     amp : TTY=pts/0 ; 
PWD=/home/users/amp ; USER=root ; COMMAND=/usr/sbin/service sshd restart
Feb  8 15:21:39 host-10-104-0-188 polkitd[497]: Registered Authentication Agent 
for unix-process:29709:569722 (system bus name :1.29 [/usr/bin/pkttyagent 
--notify-fd 5 --fallback], object path 
/org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_GB.UTF-8)
Feb  8 15:21:39 host-10-104-0-188 polkitd[497]: Unregistered Authentication 
Agent for unix-process:29709:569722 (system bus name :1.29, object path 
/org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_GB.UTF-8) 
(disconnected from bus)
Feb  8 15:21:39 host-10-104-0-188 sshd[29727]: error: Bind to port 22 on 
0.0.0.0 failed: Address already in use.
Feb  8 15:21:39 host-10-104-0-188 sshd[29727]: error: Bind to port 22 on :: 
failed: Address already in use.
Feb  8 15:21:39 host-10-104-0-188 sshd[29727]: fatal: Cannot bind any address.
{noformat}

My wild guess at what has happened here is...
1. systemd thinks sshd is not running, so did not stop it.
2. systemd starts another sshd process, which fails because process 28942 is 
already running and listening on port 22.
3. systemd looks at the pid file, and finds 28942 written there - that sshd 
process is running, so it reports sshd as "active (running)".

The sshd process (28942) has *not* re-loaded the sshd_config file, so changes 
will not have taken effect.

**Observed Behaviour: stop + start, after a failed reload**

Note that calling {{service sshd stop; service sshd start}} after a bad reload 
will behave in the same way as the {{service sshd restart}}.


**Observed Behaviour: subsequent restart, after failed restart**

A subsequent call to restart now succeeds as expected. The process is replaced, 
and the changes made in sshd_config take effect.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to