Hi Mark,

On Thu, Feb 1, 2018 at 10:29 AM, Mark Blackman <[email protected]> wrote:>
> Thanks, for now, we will treat the “nasty error” as a separate
> question to resolve and hope that clean-up patch deals with the
> immediate issue.

OK, that patch can be discussed on bz if it doesn't turn too technical.
Technicals, (long) discussions and debugging is not very friendly for
future visitors of bz which may encounter the same issue to go to the
solution...

>
> I had originally treated that “nasty error” as a reference to the
> “file exists” error. However, based on your feedback and reviewing
> the logs, I would conclude that “nasty error” is the trigger, as you
> suggrest, and the lack of SHM clean-up and consequent collisions are
> collateral damage.

That's what I feel, but I wouldn't stake my life on it either :)

>
> Just to confirm, you expect that patch to handle SHM clean-up even in
> the “nasty error” case?

Not really, no patch can avoid a crash for a crashing code :/
The "stop_signals-PR61558.patch" patch avoids a known httpd crash in
some circumstances, but...

> I suspect that nasty error is triggered by
> the Weblogic plugin based on the adjacency in the logs, but the
> tracing doesn’t reveal any details, so an strace will probably be
> required to get more detail.

... if the crash is not related, that won't help.

I'm missing something in your scenario though.

In the original/non-patched code and still with the "generation
number" patch (aka "Jim's"), there is always an attempt to attach the
SHM first and only it that fails a new one is created.
It means that even if the parent process crashes without cleaning up
the SHM on the system, whether or not some children are still alive
when a new httpd instance is started, it should be able to attach the
SHM (create would fail, but not attach).
Btw, things would probably turn bad soon or later because
synchronization assumptions are off (old and new children wouldn't
share the same mutex which is not reused/attached on startup, global
mutexes leak in the system for that scenario more than SHMs).
So why both attach and create fails in your case?

With my proposed patch (r1822509), since I removed attach (bullet 4/
in the commit message), your scenario is "expected" to fail when the
second httpd instance starts (while old children are still alive).
I'm not sure I should fix this (re-introduce the attach code) because
as I said this is a screwy scenario with regard to the global mutex,
it's not supposed to work like this.
The only sane thing to do here (IMHO, and more a note to other httpd
devs) would be to kill children whenever the parent process dies
underneath them, be it with a startup script (there shouldn't be any
orphaned child process, at least when httpd starts), or natively in
the MPM which could detect this situation (that's another story
though, and it probably should be opt-in because it depends on how
httpd is started/monitored externally, and how much the user want the
service to continue as much as possible...).

So the faster/simpler solution *for you* might be to create/modify
your (re)startup script such that it kills orphaned children, if any,
in prevention...

>
> Bugzilla was slightly easier to get log data into as I cannot use
> work email for these conversations.

There is no strong statement/rule on bz vs dev@, if it's more
convenient for you to continue there this is a good reason ;)
I wouldn't go as far in the discussion as I did here, though (sorry if
it was too long btw).


Regards,
Yann.

Reply via email to