Am 26.11.2014 14:27, schrieb Lars Ellenberg:

DRBD "logging" is simply a printk.
Whether or not that makes it to stable storage via some syslog channel
or not is no longer in control of DRBD.
Especially if the storage in fact *did* have problems, I think it is
very unlikely that any logging would have made it to disk on that box...

I don't think the storage ACTUALLY had a problem besides possibly being under high load. At least I cannot tell that anything was bad from the raid controller or kernel logs. Besides that as I said the syslog is on a separate disk subsystem, presented by a different controller, using a different driver, so I assume even if some raid controller or disk subsystem is having a problem it should still always be possible to log to syslog as long as the system has not crashed.

Also: the disk-timeout option is *dangerous* and *may lead to kernel
panic*.  So don't use it (unless you are *very* certain that you know
what you are doing, and have a very good reason to do it).

I read that before and my intent is the following:

If a disk subsystem on the master is neither reacting nor throwing i/o errors the master role should be transfered to the peer no matter what. So I would be accepting a kernel panic occuring in such situation rather than waiting forever for a non reacting disk subsystem which would be less acceptable in my opinion.

The problem in this situation was that I prepared the drbd config for a cluster manager installed and properly configured to do all that but in fact I did not have enough time in the last maintenance time window to apply my cluster configuration, for other problems that occured.

So in this situation the disk-timeout does not make sense as I risk the system crashing here and noone taking over. So I removed the disk-timeout setting now but still intend to use it later when the cluster manager is in place.

But I still have to monitor this behaviour again in my test-setup to make sure I never reach a disk-timeout situation in normal working conditions, but as far as I can tell from my munin logs and watching iostat under high load it should never be the case that a volume is inresponsive for more than 30s, at least as long as it does not ACTUALLY have a serious problem.

Of course there may be bugs in our code, so if you should be able to
reproduce "misbehaviour", let us know.

I will do testing with this again in my lab to see under which conditions the disk-timeout might be reached. Thank you for commenting.

regards, Felix
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to