Re: How assign some logic to handle system-gone-totally-unresponsive events (if not else then to enable admin with differentiated failure tracking between userland and hardware failures)

Stuart Henderson Tue, 18 Oct 2016 05:23:10 -0700

On 2016-10-17, Karel Gardas <[email protected]> wrote:
> 1) use machine with proper ECC support
> 2) man sendbug -- and following it report your OpenBSD kernel misbehavior

This can be a hard thing to report.

When the machine totally locks up, it is very difficult to get the information
needed to make a bug report, often it is not known exactly how to trigger it,
or whether it's software bug, bit flip, or a hardware fault.

Sometimes you can get useful information from monitoring the machine in the
run-up to a failure - symon (in ports) can be useful for logging things to a
remote machine at an interval which is often fast enough to give clues into
what might be happening. But unless you have a reproducible case, or something
which happens randomly but fairly often, you can be watching for a long time
and not really not exactly what to be monitoring.

On the other hand if you do have a *reproducible* way to trigger such a bug,
that's of great interest.

> On Mon, Oct 17, 2016 at 3:48 PM, Tinker <[email protected]> wrote:
>> Sometimes a machine goes unresponsive. In this case, a non-ECC RAM machine.
>>
>> The reason could be that something in the hardware or kernel failed, e.g. a
>> bit flip error [1].
>>
>> In this case (for a non-kernel developer), tough luck, and the proper thing
>> would be to reboot, and keep statistics over failures on that machine and
>> replace the hardware should the crashes go above some frequency threshold.

If you're not running an up-to-date release, please do so: stefan@'s work on
amap in the 5.9-6.0 timeframe certainly helps some cases - one of the post-6.0
errata might also apply with very large allocations, so 6.0-stable or -current
would be advisable.

Re: How assign some logic to handle system-gone-totally-unresponsive events (if not else then to enable admin with differentiated failure tracking between userland and hardware failures)

Reply via email to