Perhaps, but if that were the case, would you expect the max concurrent number of deep-scrubs to approach the number of OSDs in the cluster?

I have 72 OSDs in this cluster and concurrent deep-scrubs seem to peak at a max of 12. Do pools (two in use) and replication settings (3 copies in both pools) factor in?

72 OSDs / (2 pools * 3 copies) = 12 max concurrent deep-scrubs

That seems plausible (without looking at the code).

But, if I 'ceph osd set nodeep-scrub' then 'ceph osd unset nodeep-scrub', the count of concurrent deep-scrubs doesn't resume the high level, but rather stays low seemingly for days at a time, until the next onslaught. If driven by the max scrub interval, shouldn't it jump quickly back up?

Is there way to find the last scrub time for a given PG via the CLI to know for sure?

Thanks,
Mike Dawson

On 5/7/2014 10:59 PM, Gregory Farnum wrote:
Is it possible you're running into the max scrub intervals and jumping
up to one-per-OSD from a much lower normal rate?

On Wednesday, May 7, 2014, Mike Dawson <[email protected]
<mailto:[email protected]>> wrote:

    My write-heavy cluster struggles under the additional load created
    by deep-scrub from time to time. As I have instrumented the cluster
    more, it has become clear that there is something I cannot explain
    happening in the scheduling of PGs to undergo deep-scrub.

    Please refer to these images [0][1] to see two graphical
    representations of how deep-scrub goes awry in my cluster. These
    were two separate incidents. Both show a period of "happy" scrub and
    deep-scrubs and stable writes/second across the cluster, then an
    approximately 5x jump in concurrent deep-scrubs where client IO is
    cut by nearly 50%.

    The first image (deep-scrub-issue1.jpg) shows a happy cluster with
    low numbers of scrub and deep-scrub running until about 10pm, then
    something triggers deep-scrubs to increase about 5x and remain high
    until I manually 'ceph osd set nodeep-scrub' at approx 10am. During
    the time of higher concurrent deep-scrubs, IOPS drop significantly
    due to OSD spindle contention preventing qemu/rbd clients from
    writing like normal.

    The second image (deep-scrub-issue2.jpg) shows a similar approx 5x
    jump in concurrent deep-scrubs and associated drop in writes/second.
    This image also adds a summary of the 'dump historic ops' which show
    the to be expected jump in the slowest ops in the cluster.

    Does anyone have an idea of what is happening when the spike in
    concurrent deep-scrub occurs and how to prevent the adverse effects,
    outside of disabling deep-scrub permanently?

    0: http://www.mikedawson.com/__deep-scrub-issue1.jpg
    <http://www.mikedawson.com/deep-scrub-issue1.jpg>
    1: http://www.mikedawson.com/__deep-scrub-issue2.jpg
    <http://www.mikedawson.com/deep-scrub-issue2.jpg>

    Thanks,
    Mike Dawson
    _________________________________________________
    ceph-users mailing list
    [email protected]
    http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
    <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>



--
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to