Re: [ceph-users] Deep-Scrub Scheduling

Mike Dawson Wed, 07 May 2014 20:48:08 -0700

Perhaps, but if that were the case, would you expect the max concurrentnumber of deep-scrubs to approach the number of OSDs in the cluster?

I have 72 OSDs in this cluster and concurrent deep-scrubs seem to peakat a max of 12. Do pools (two in use) and replication settings (3 copiesin both pools) factor in?


72 OSDs / (2 pools * 3 copies) = 12 max concurrent deep-scrubs

That seems plausible (without looking at the code).

But, if I 'ceph osd set nodeep-scrub' then 'ceph osd unsetnodeep-scrub', the count of concurrent deep-scrubs doesn't resume thehigh level, but rather stays low seemingly for days at a time, until thenext onslaught. If driven by the max scrub interval, shouldn't it jumpquickly back up?

Is there way to find the last scrub time for a given PG via the CLI toknow for sure?


Thanks,
Mike Dawson

On 5/7/2014 10:59 PM, Gregory Farnum wrote:

Is it possible you're running into the max scrub intervals and jumping
up to one-per-OSD from a much lower normal rate?

On Wednesday, May 7, 2014, Mike Dawson <[email protected]
<mailto:[email protected]>> wrote:

    My write-heavy cluster struggles under the additional load created
    by deep-scrub from time to time. As I have instrumented the cluster
    more, it has become clear that there is something I cannot explain
    happening in the scheduling of PGs to undergo deep-scrub.

    Please refer to these images [0][1] to see two graphical
    representations of how deep-scrub goes awry in my cluster. These
    were two separate incidents. Both show a period of "happy" scrub and
    deep-scrubs and stable writes/second across the cluster, then an
    approximately 5x jump in concurrent deep-scrubs where client IO is
    cut by nearly 50%.

    The first image (deep-scrub-issue1.jpg) shows a happy cluster with
    low numbers of scrub and deep-scrub running until about 10pm, then
    something triggers deep-scrubs to increase about 5x and remain high
    until I manually 'ceph osd set nodeep-scrub' at approx 10am. During
    the time of higher concurrent deep-scrubs, IOPS drop significantly
    due to OSD spindle contention preventing qemu/rbd clients from
    writing like normal.

    The second image (deep-scrub-issue2.jpg) shows a similar approx 5x
    jump in concurrent deep-scrubs and associated drop in writes/second.
    This image also adds a summary of the 'dump historic ops' which show
    the to be expected jump in the slowest ops in the cluster.

    Does anyone have an idea of what is happening when the spike in
    concurrent deep-scrub occurs and how to prevent the adverse effects,
    outside of disabling deep-scrub permanently?

    0: http://www.mikedawson.com/__deep-scrub-issue1.jpg
    <http://www.mikedawson.com/deep-scrub-issue1.jpg>
    1: http://www.mikedawson.com/__deep-scrub-issue2.jpg
    <http://www.mikedawson.com/deep-scrub-issue2.jpg>

    Thanks,
    Mike Dawson
    _________________________________________________
    ceph-users mailing list
    [email protected]
    http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
    <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>



--
Software Engineer #42 @ http://inktank.com | http://ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deep-Scrub Scheduling

Reply via email to