Re: [DRBD-user] Semantics of oos value, verification abortion

Christoph Lechleitner Thu, 28 Dec 2017 02:31:59 -0800

On 2017-12-28 04:25, Veit Wahlich wrote:
> Hi Christoph, 
> 
> I believe that, at least for synchronous replication with protocol C,


We use Protocol C, too.


> the oos count should always be 0 in a healthy, fully synchronized 
> configuration, and that any occurance of a value >0 (except for currently 
> running manual administrative tasks)

You mean something like switching roles?
Not an issue, we only switch roles when we really really have to (about
every 2 years, to upgrade OS and sometimes the hardware).


> indicates a problem that requires to be investigated. Therefore I regard an 
> automated disconnect-connect, for the sole purpose of clearing the oos 
> counter without determining the cause, both a very bad idea and bad practice.

But,

https://docs.linbit.com/doc/users-guide-84/s-use-online-verify/#s-online-verify-invoke
clearly recommends verify + disconnect + connect as solution.


However, we have non-0 oos values way too often for something that is
supposed to be stable for years now.

We have set both data-integrity-alg and verify-alg to crc32c, and the
dedicated GBit connections are usually heavily underused, so how can
there be synchronity problems on a daily bases?

My fear is that DRBD still has problems with I/O peak situations.

But I can't image a sub-optimal or "wrong" configuration "causing" that.


> We have run hundreds of synchronously replicated DRBD8 volumes for years now 
> that we verify weekly, but we never ever sighted oos that were not either 
> caused by a runtime, configuration or hardware issue.
> 
> Our verification runs utilise a script similar to yours, but it actively 
> parallelises the task to optimise for minimum duration while maintaining a 
> constant load that won't harm performance. It does so by sorting all volumes 
> by size and then run a given number of verify tasks at once, beginning with 
> the largest volumes, and starting the next verify once one finishes. 
> Especially on machines that have few very big volumes and lots of small ones, 
> this allows to complete the verification of all volumes at the time the big 
> volumes take alone, thus minimal duration at constant I/O load without peaks. 
> The script prints a report to stdout with any occurance of oos to stderr, 
> making it easy to filter for any problems -- even before monitoring notices.

Nice approach ;-)


Regards,

Christoph
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Re: [DRBD-user] Semantics of oos value, verification abortion

Reply via email to