On 2017-12-28 04:25, Veit Wahlich wrote: > Hi Christoph, > > I believe that, at least for synchronous replication with protocol C,
We use Protocol C, too. > the oos count should always be 0 in a healthy, fully synchronized > configuration, and that any occurance of a value >0 (except for currently > running manual administrative tasks) You mean something like switching roles? Not an issue, we only switch roles when we really really have to (about every 2 years, to upgrade OS and sometimes the hardware). > indicates a problem that requires to be investigated. Therefore I regard an > automated disconnect-connect, for the sole purpose of clearing the oos > counter without determining the cause, both a very bad idea and bad practice. But, https://docs.linbit.com/doc/users-guide-84/s-use-online-verify/#s-online-verify-invoke clearly recommends verify + disconnect + connect as solution. However, we have non-0 oos values way too often for something that is supposed to be stable for years now. We have set both data-integrity-alg and verify-alg to crc32c, and the dedicated GBit connections are usually heavily underused, so how can there be synchronity problems on a daily bases? My fear is that DRBD still has problems with I/O peak situations. But I can't image a sub-optimal or "wrong" configuration "causing" that. > We have run hundreds of synchronously replicated DRBD8 volumes for years now > that we verify weekly, but we never ever sighted oos that were not either > caused by a runtime, configuration or hardware issue. > > Our verification runs utilise a script similar to yours, but it actively > parallelises the task to optimise for minimum duration while maintaining a > constant load that won't harm performance. It does so by sorting all volumes > by size and then run a given number of verify tasks at once, beginning with > the largest volumes, and starting the next verify once one finishes. > Especially on machines that have few very big volumes and lots of small ones, > this allows to complete the verification of all volumes at the time the big > volumes take alone, thus minimal duration at constant I/O load without peaks. > The script prints a report to stdout with any occurance of oos to stderr, > making it easy to filter for any problems -- even before monitoring notices. Nice approach ;-) Regards, Christoph _______________________________________________ drbd-user mailing list [email protected] http://lists.linbit.com/mailman/listinfo/drbd-user
