On 2019-01-14, Marco Prause <[email protected]> wrote:
> after an initial boot, everything is working fine for round about 4 hours.
>
> After 4 hours, it is not possible to login into the backup/secondary
> openbsd-server via ssh or even via serial console, but it seems to still
> forward traffic correctly. Also the ospf adjacencies are up&running as
> well as ipsec security associations and so on.
>
> Monitoring metrics doesn't show any meassured increase of any data.
>
> I've already exchanged the hardware, because it was my first guess, as
> the first server/gateway is running without any problems with the same
> 6.4-stable and config version - but this unfortunately didn't help.

Is it the same or different hardware type and BIOS version for the
working and hanging machines? (maybe diff the two dmesgs)

Same or different filesystem mount options?  (Are you using softdep?)

> When I left an serial console login opened, I was able to execute some
> commands and also a top, I've invoked before, was still running at the
> failure-state. But when entering e.g. ifconfig, or trying a
> tab-completion also the serial console freezes.

The "WAIT" column of a running top(1) may include useful information.

If possible, run with "sysctl ddb.console=1" (needs setting
pre-securelevel, add it to sysctl.conf if it's not already there),
which should allow you to enter ddb by sending a BREAK signal over
the serial line (~# in cu(1)). You can try that under normal
operation (will interrupt service; be ready to type "c" and
enter to continue to resume) to check it works.

Then during a hang attempt to enter ddb, if you are successful then
capture at least the following:

ps
trace

Ideally also switch to all other cpus (the number in the ddb
prompt shows the current one; you can do "mach ddbcpu 3" etc
to switch to another) and re-run trace (which is completely
per-cpu), ps (the line marked "*" indicates the currently
active process on the currently selected CPU - for a report
there's no need to repeat the entire list N times but could
be useful to indicate the running processes on all CPUs).
When you are done with these then also fetch:

sh malloc
sh all pools

For the benefit of other readers who don't have serial console,
ctrl+alt+esc on the keyboard will do the same if the
keyboard/monitor are the selected console device, obviously
it will be harder to capture the output in an easily readable
format!


Reply via email to