[BlueOnyx:21467] Re: Bizarre 5209R loses network config

Michael Stauber Tue, 03 Oct 2017 14:05:43 -0700

Hi Chris,

> /etc/udev/rules.d/70-persistent-net.rules had two distinct MAC entries
> listed for eth0.   Two entries, just as you might expect when there are
> 2 interfaces (typically eth0 and eth1) but both were eth0.


This is pretty weird indeed. Just to clarify: The reported MAC addresses
for eth0 were the same or were they different? If they were different,
that would make it even weirder as the MAC address of an interface isn't
supposed to change.

> the net.ifnames=0 variable.   It was missing.

All the 5209R ISOs do set the net.ifnames=0 during post install and
during kernel upgrades these options are usually retained. I'm at a loss
why this would go walkies on its own. :-/

FWIW: I just fired up a 5209R virtualized under VirtualBox, updated it
with half a year of updates for CentOS 7 and BlueOnyx and it had no
issues at all. Yet it's not representative, as that VPS is not
productive and had been in shutdown state for months.

> The only thing that gives me concern, though, is what if those boxes
> that had Apache lock up on them late last week / early this week are
> going to do the same thing when they are rebooted.  (Obviously this will
> apply to physical bare-metal installs and not Aventurin{e} virtuals.)

The likelihood of this to happen again is pretty small.

Fact of the matter is: Apache is a bitch to restart or reload
non-interactively. If you do so while one child process is serving a
HTTP or HTTPS request, the master process and all childs (but the active
one) will terminate. And the still busy child process will remain
behind. At that point any further attempts to restart/reload Apache will
complain that it's still active. Even on Systemd boxes, as that one
stray child process that's still hanging around is sending heartbeat
signals to Systemd and on InitV systems it's presence is detected by
other means such as port 80 still being in use.

We no longer issue "reload" command to HTTPd either, because it produces
the most unpredictable behavior and even behaves slightly different
between Apache 2.2 + InitV and Apache 2.4 + Systemd.

We now run Swatch at the end of any YUM update that issues a CCEd rehash
or restart and almost all BlueOnyx updates do one of these for good
measure.

The Apache related tests in Swatch do two distinct tests: A telnet to
port 80 with a parsing of the response. This still may not reliably
detect a semi-dead Apache on it's own, depending on what IP address the
Apache still has that detached child process listening.

This second test is a bit more complex and uses the process list to find
out if Apache children have detached or not:

ps axf|grep /usr/sbin/httpd|grep -v adm|grep -v grep|grep -v '\_'|wc -l

That should report a number. It checks if there are any non-admserv
httpd processes in the process list who aren't child processes. There
should be only one if all is OK:

    #   0   Apache dead
    #   1   Apache probably running OK
    #  >1   Childs have detached (bad)

If we don't get "1" as response to that check, we selectively kill each
and any non-admserv HTTPd process this way:

ps axf|grep /usr/sbin/httpd|grep -v adm|grep -v grep|grep -v '\_'|awk -F
' ' '{print $1}'|xargs kill -9

This leaves AdmServ running, but kills Apache for good. Then we just
issue a normal restart of Apache via /sbin/service or systemctl -
depending on the platform.

There isn't much room for further improvement to that, unless I start
over and use a radically different approach. I am considering turning
Swatch into a daemon that performs health checks more often than the
current 15 minute cronjob. Part of the improved health checks would be
periodic protocol specific connection attempts to the actual network
ports to see if they react in time and in the expected manner, but for
some problematic services such as Apache we still would need to do a
little extra like demonstrated above. Additionally the GUI (and it's
components) could collect and round-about requests to restart/reload
services via the existing Sauce::Service through this new daemon.

In that case multiple subsequent restart/reload events would be combined
into a single one, would be issued to Systemd/InitV and right after that
the health of the service is checked again to confirm that it took the
ordered dive and is now back up and healthy again.

There are a couple of reasons that in favor of such a daemon (and I
haven't mentioned all of them yet), but also 2-3 that speak against it.
I'd rather wait an see if the recent changes to Swatch and the YUM
plugin take good enough care of the issue as is, before I set off to
code the Swatch-daemon.

-- 
With best regards

Michael Stauber
_______________________________________________
Blueonyx mailing list
Blueonyx@mail.blueonyx.it
http://mail.blueonyx.it/mailman/listinfo/blueonyx

[BlueOnyx:21467] Re: Bizarre 5209R loses network config

Reply via email to