Am 26.11.2014 14:27, schrieb Lars Ellenberg:
DRBD "logging" is simply a printk.
Whether or not that makes it to stable storage via some syslog channel
or not is no longer in control of DRBD.
Especially if the storage in fact *did* have problems, I think it is
very unlikely that any logging would have made it to disk on that box...
I don't think the storage ACTUALLY had a problem besides possibly being
under high load. At least I cannot tell that anything was bad from the
raid controller or kernel logs. Besides that as I said the syslog is on
a separate disk subsystem, presented by a different controller, using a
different driver, so I assume even if some raid controller or disk
subsystem is having a problem it should still always be possible to log
to syslog as long as the system has not crashed.
Also: the disk-timeout option is *dangerous* and *may lead to kernel
panic*. So don't use it (unless you are *very* certain that you know
what you are doing, and have a very good reason to do it).
I read that before and my intent is the following:
If a disk subsystem on the master is neither reacting nor throwing i/o
errors the master role should be transfered to the peer no matter what.
So I would be accepting a kernel panic occuring in such situation rather
than waiting forever for a non reacting disk subsystem which would be
less acceptable in my opinion.
The problem in this situation was that I prepared the drbd config for a
cluster manager installed and properly configured to do all that but in
fact I did not have enough time in the last maintenance time window to
apply my cluster configuration, for other problems that occured.
So in this situation the disk-timeout does not make sense as I risk the
system crashing here and noone taking over. So I removed the
disk-timeout setting now but still intend to use it later when the
cluster manager is in place.
But I still have to monitor this behaviour again in my test-setup to
make sure I never reach a disk-timeout situation in normal working
conditions, but as far as I can tell from my munin logs and watching
iostat under high load it should never be the case that a volume is
inresponsive for more than 30s, at least as long as it does not ACTUALLY
have a serious problem.
Of course there may be bugs in our code, so if you should be able to
reproduce "misbehaviour", let us know.
I will do testing with this again in my lab to see under which
conditions the disk-timeout might be reached. Thank you for commenting.
regards, Felix
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user