I am not an expert on that, but, probably these settings will help backfill to go slow and thus less degradation on client IO. You may want to try..
Thanks & Regards Somnath -----Original Message----- From: Robert LeBlanc [mailto:rob...@leblancnet.us] Sent: Thursday, September 10, 2015 3:16 PM To: Somnath Roy Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Hammer reduce recovery impact -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Do the recovery options kick in when there is only backfill going on? - ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Sep 10, 2015 at 3:01 PM, Somnath Roy wrote: > Try all these.. > > osd recovery max active = 1 > osd max backfills = 1 > osd recovery threads = 1 > osd recovery op priority = 1 > > Thanks & Regards > Somnath > > -----Original Message----- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Robert LeBlanc > Sent: Thursday, September 10, 2015 1:56 PM > To: ceph-users@lists.ceph.com > Subject: [ceph-users] Hammer reduce recovery impact > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > We are trying to add some additional OSDs to our cluster, but the impact of > the backfilling has been very disruptive to client I/O and we have been > trying to figure out how to reduce the impact. We have seen some client I/O > blocked for more than 60 seconds. There has been CPU and RAM head room on the > OSD nodes, network has been fine, disks have been busy, but not terrible. > > 11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals (10GB), > dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta S51G-1UL. > > Clients are QEMU VMs. > > [ulhglive-root@ceph5 current]# ceph --version ceph version 0.94.2 > (5fb85614ca8f354284c713a2f9c610860720bbf3) > > Some nodes are 0.94.3 > > [ulhglive-root@ceph5 current]# ceph status > cluster 48de182b-5488-42bb-a6d2-62e8e47b435c > health HEALTH_WARN > 3 pgs backfill > 1 pgs backfilling > 4 pgs stuck unclean > recovery 2382/33044847 objects degraded (0.007%) > recovery 50872/33044847 objects misplaced (0.154%) > noscrub,nodeep-scrub flag(s) set > monmap e2: 3 mons at > {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0} > election epoch 180, quorum 0,1,2 mon1,mon2,mon3 > osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs > flags noscrub,nodeep-scrub > pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects > 128 TB used, 322 TB / 450 TB avail > 2382/33044847 objects degraded (0.007%) > 50872/33044847 objects misplaced (0.154%) > 2300 active+clean > 3 active+remapped+wait_backfill > 1 active+remapped+backfilling recovery io 70401 kB/s, 16 > objects/s > client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s > > Each pool is size 4 with min_size 2. > > One problem we have is that the requirements of the cluster changed after > setting up our pools, so our PGs are really out of wack. Our most active pool > has only 256 PGs and each PG is about 120 GB is size. > We are trying to clear out a pool that has way too many PGs so that we can > split the PGs in that pool. I think these large PGs is part of our issues. > > Things I've tried: > > * Lowered nr_requests on the spindles from 1000 to 100. This reduced the max > latency sometimes up to 3000 ms down to a max of 500-700 ms. > it has also reduced the huge swings in latency, but has also reduced > throughput somewhat. > * Changed the scheduler from deadline to CFQ. I'm not sure if the the OSD > process gives the recovery threads a different disk priority or if changing > the scheduler without restarting the OSD allows the OSD to use disk > priorities. > * Reduced the number of osd_max_backfills from 2 to 1. > * Tried setting noin to give the new OSDs time to get the PG map and peer > before starting the backfill. This caused more problems than solved as we had > blocked I/O (over 200 seconds) until we set the new OSDs to in. > > Even adding one OSD disk into the cluster is causing these slow I/O messages. > We still have 5 more disks to add from this server and four more servers to > add. > > In addition to trying to minimize these impacts, would it be better to split > the PGs then add the rest of the servers, or add the servers then do the PG > split. I'm thinking splitting first would be better, but I'd like to get > other opinions. > > No spindle stays at high utilization for long and the await drops below 20 ms > usually within 10 seconds so I/O should be serviced "pretty quick". My next > guess is that the journals are getting full and blocking while waiting for > flushes, but I'm not exactly sure how to identify that. We are using the > defaults for the journal except for size (10G). We'd like to have journals > large to handle bursts, but if they are getting filled with backfill traffic, > it may be counter productive. Can/does backfill/recovery bypass the journal? > > Thanks, > > - ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -----BEGIN > PGP SIGNATURE----- > Version: Mailvelope v1.0.2 > Comment: https://www.mailvelope.com > > wsFcBAEBCAAQBQJV8e5qCRDmVDuy+mK58QAAaIwQAMN5DJlhrZkqwqsVXaKB > nnegQjG6Y02ObLRrg96ghHr+AGgY/HRm3iShng6E1N9CL+XjcHSLeb1JqH9n > 2SgGQGoRAU1dY6DIlOs5K8Fwd2bBECh863VymYbO+OLgtXbpp2mWfZZVAkTf > V9ryaEh7tZOY1Mhx7mSIyr9Ur7IxTUOjzExAFPGfTLP1cbjE/FXoQMHh10fe > zSzk/qK0AvajFD0PR04uRyEsGYeCLl68kGQi1R7IQlxZWc7hMhWXKNIFlbKB > lk5+8OGx/LawW7qxpFm8a1SNoiAwMtrPKepvHYGi8u3rfXJa6ZE38jGuoqRs > 8jD+b+gS0yxKbahT6S/gAEbgzAH0JF4YSz+nHNrvS6eSebykE9/7HGe9W7WA > HRAkrESi/f1MKtRkud2Nhycx2R0MZLK/HoumnCN8WUmgvOtKsyYpXj6FXghv > VGpi3r6uyC5Xlb8JGREqB1hAUTHAv0+z4biDBvPYrENwFUaerWiIujIeLWV9 > aYuiQBjjDCLoqWZj0+gQwn9/zXo8gE7jo3XAemYqGB8NJY1e+RZW6+TgC2rD > Floa1en1PzZsynm1Ho+RPWW509kog5fFkt41nJmmxRi3kNWwiJfKLJvysetl > RYudFG1cEumfI68VyNcuL4dMzf9FsiADsBaHue8g9a5bjJH8LjK4fKZDCCJf > Rzgu > =vlrz > -----END PGP SIGNATURE----- > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ________________________________ > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If the > reader of this message is not the intended recipient, you are hereby notified > that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly > prohibited. If you have received this communication in error, please notify > the sender by telephone or e-mail (as shown above) immediately and destroy > any and all copies of this message in your possession (whether hard copies or > electronically stored copies). > -----BEGIN PGP SIGNATURE----- Version: Mailvelope v1.0.2 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJV8gEnCRDmVDuy+mK58QAAQ7QQAJjm1tu9Tp8q+TPXS6k/ +MXfpW28p54y67gfBcGHSOJd/VzJsIytFeO9Q5r6uA3U+JFvxVeN8Jpbp8qF JyjAR2qttW5MnOcZm8Zf8VI6RVNfCXw9KIqCtO8ZWN89JKNg0ImXqMKOK5rL wg1wuk+fFF8PvJlweQS9xOFdXgxfnMXlLfXoYccHzRsRyTHIixrVED1vWgAA oLSOYySPaLTjJLfaBIu1M4tb60BLA9Z1rNsHZPLEODGZCCCFEwxjYB+hzDtS BbnShRU89rlzkixW22NzGEbjLBUR9stRMfRGDAd8iHOiisqmrJJkiVK/3ZSX kyj+aXLE0pCS/Z/w+Utyg0B1jc6kwUoAcdE8q1OMYUEUCC39ZQxJhtJLDarF vn/XUCBrDu5f/sVt8z2fjxdQIBvX7fYFN9Quf0gvlXVico+gu3lEBezXzSDX gIAJu6B1RoWL445reDZbdPE5ZaXQP/HkcDhwIL6h0i+1PLjPw6dyR9mJ65OR Byor/5/tfCOuH6nTgBYNa2Ty4FHx0FzlwVLeUktRlameQ/XoLf51ZIncR/XZ rl/lrizRvAm0jMJL11IvMcjnPUZxTBcqJmgk4Zq1w1I62smtZ7gw5C0T/dDv oi5/vpDzgDiASEd8GNA5pYsZZHtZicSXzFGbBdj/FwsIJGneTzbUMN/2M9ND nHow =+qf1 -----END PGP SIGNATURE----- _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com