I just applied the following settings to my cluster and it resulted in much better behavior in the hosted VMs:
osd_backfill_scan_min = 2 osd_backfill_scan_max = 16 osd_recovery_max_active = 1 osd_max_backfills = 1 osd_recovery_threads = 1 osd_recovery_op_priority = 1 On my "canary" VM iowait dropped from a hard 50% or more to recurring wave of nothing up to 25%, then down again, which is apparently low enough that my users aren't noticing it. Recovery is of course taking much longer, but since I can now do OSD maintenance operations during the day, it's a big win. QH On Wed, Sep 16, 2015 at 9:42 AM, Robert LeBlanc <rob...@leblancnet.us> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > I was out of the office for a few days. We have some more hosts to > add. I'll send some logs for examination. > - ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Fri, Sep 11, 2015 at 12:45 AM, GuangYang wrote: > > If we are talking about requests being blocked 60+ seconds, those > tunings might not help (they help a lot for average latency during > recovering/backfilling). > > > > It would be interesting to see the logs for those blocked requests at > OSD side (they have level 0), pattern to search might be "slow requests \d+ > seconds old". > > > > I had a problem that for a recovery candidate object, all updates to > that object would stuck until it is recovered, that might take extremely > long time if there are large number of PG and objects to recover. But I > think that is resolved by Sam to allow write for degraded objects in Hammer. > > > > ---------------------------------------- > >> Date: Thu, 10 Sep 2015 14:56:12 -0600 > >> From: rob...@leblancnet.us > >> To: ceph-users@lists.ceph.com > >> Subject: [ceph-users] Hammer reduce recovery impact > >> > >> -----BEGIN PGP SIGNED MESSAGE----- > >> Hash: SHA256 > >> > >> We are trying to add some additional OSDs to our cluster, but the > >> impact of the backfilling has been very disruptive to client I/O and > >> we have been trying to figure out how to reduce the impact. We have > >> seen some client I/O blocked for more than 60 seconds. There has been > >> CPU and RAM head room on the OSD nodes, network has been fine, disks > >> have been busy, but not terrible. > >> > >> 11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals > >> (10GB), dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta > >> S51G-1UL. > >> > >> Clients are QEMU VMs. > >> > >> [ulhglive-root@ceph5 current]# ceph --version > >> ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) > >> > >> Some nodes are 0.94.3 > >> > >> [ulhglive-root@ceph5 current]# ceph status > >> cluster 48de182b-5488-42bb-a6d2-62e8e47b435c > >> health HEALTH_WARN > >> 3 pgs backfill > >> 1 pgs backfilling > >> 4 pgs stuck unclean > >> recovery 2382/33044847 objects degraded (0.007%) > >> recovery 50872/33044847 objects misplaced (0.154%) > >> noscrub,nodeep-scrub flag(s) set > >> monmap e2: 3 mons at > >> {mon1= > 10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0} > >> election epoch 180, quorum 0,1,2 mon1,mon2,mon3 > >> osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs > >> flags noscrub,nodeep-scrub > >> pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects > >> 128 TB used, 322 TB / 450 TB avail > >> 2382/33044847 objects degraded (0.007%) > >> 50872/33044847 objects misplaced (0.154%) > >> 2300 active+clean > >> 3 active+remapped+wait_backfill > >> 1 active+remapped+backfilling > >> recovery io 70401 kB/s, 16 objects/s > >> client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s > >> > >> Each pool is size 4 with min_size 2. > >> > >> One problem we have is that the requirements of the cluster changed > >> after setting up our pools, so our PGs are really out of wack. Our > >> most active pool has only 256 PGs and each PG is about 120 GB is size. > >> We are trying to clear out a pool that has way too many PGs so that we > >> can split the PGs in that pool. I think these large PGs is part of our > >> issues. > >> > >> Things I've tried: > >> > >> * Lowered nr_requests on the spindles from 1000 to 100. This reduced > >> the max latency sometimes up to 3000 ms down to a max of 500-700 ms. > >> it has also reduced the huge swings in latency, but has also reduced > >> throughput somewhat. > >> * Changed the scheduler from deadline to CFQ. I'm not sure if the the > >> OSD process gives the recovery threads a different disk priority or if > >> changing the scheduler without restarting the OSD allows the OSD to > >> use disk priorities. > >> * Reduced the number of osd_max_backfills from 2 to 1. > >> * Tried setting noin to give the new OSDs time to get the PG map and > >> peer before starting the backfill. This caused more problems than > >> solved as we had blocked I/O (over 200 seconds) until we set the new > >> OSDs to in. > >> > >> Even adding one OSD disk into the cluster is causing these slow I/O > >> messages. We still have 5 more disks to add from this server and four > >> more servers to add. > >> > >> In addition to trying to minimize these impacts, would it be better to > >> split the PGs then add the rest of the servers, or add the servers > >> then do the PG split. I'm thinking splitting first would be better, > >> but I'd like to get other opinions. > >> > >> No spindle stays at high utilization for long and the await drops > >> below 20 ms usually within 10 seconds so I/O should be serviced > >> "pretty quick". My next guess is that the journals are getting full > >> and blocking while waiting for flushes, but I'm not exactly sure how > >> to identify that. We are using the defaults for the journal except for > >> size (10G). We'd like to have journals large to handle bursts, but if > >> they are getting filled with backfill traffic, it may be counter > >> productive. Can/does backfill/recovery bypass the journal? > >> > >> Thanks, > >> > >> - ---------------- > >> Robert LeBlanc > >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > >> -----BEGIN PGP SIGNATURE----- > >> Version: Mailvelope v1.0.2 > >> Comment: https://www.mailvelope.com > >> > >> wsFcBAEBCAAQBQJV8e5qCRDmVDuy+mK58QAAaIwQAMN5DJlhrZkqwqsVXaKB > >> nnegQjG6Y02ObLRrg96ghHr+AGgY/HRm3iShng6E1N9CL+XjcHSLeb1JqH9n > >> 2SgGQGoRAU1dY6DIlOs5K8Fwd2bBECh863VymYbO+OLgtXbpp2mWfZZVAkTf > >> V9ryaEh7tZOY1Mhx7mSIyr9Ur7IxTUOjzExAFPGfTLP1cbjE/FXoQMHh10fe > >> zSzk/qK0AvajFD0PR04uRyEsGYeCLl68kGQi1R7IQlxZWc7hMhWXKNIFlbKB > >> lk5+8OGx/LawW7qxpFm8a1SNoiAwMtrPKepvHYGi8u3rfXJa6ZE38jGuoqRs > >> 8jD+b+gS0yxKbahT6S/gAEbgzAH0JF4YSz+nHNrvS6eSebykE9/7HGe9W7WA > >> HRAkrESi/f1MKtRkud2Nhycx2R0MZLK/HoumnCN8WUmgvOtKsyYpXj6FXghv > >> VGpi3r6uyC5Xlb8JGREqB1hAUTHAv0+z4biDBvPYrENwFUaerWiIujIeLWV9 > >> aYuiQBjjDCLoqWZj0+gQwn9/zXo8gE7jo3XAemYqGB8NJY1e+RZW6+TgC2rD > >> Floa1en1PzZsynm1Ho+RPWW509kog5fFkt41nJmmxRi3kNWwiJfKLJvysetl > >> RYudFG1cEumfI68VyNcuL4dMzf9FsiADsBaHue8g9a5bjJH8LjK4fKZDCCJf > >> Rzgu > >> =vlrz > >> -----END PGP SIGNATURE----- > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -----BEGIN PGP SIGNATURE----- > Version: Mailvelope v1.0.2 > Comment: https://www.mailvelope.com > > wsFcBAEBCAAQBQJV+Y3lCRDmVDuy+mK58QAAtr8P/1OuWSPkPnw6DC6M0CA3 > E3qX6MVyRgSo8ajv4BtjNm+ZHidCtWFd8WdNGKGd3tRGCfdnLEUKtXXBuOA6 > rplXcFikciXfvwNTlyXefpX7Ppu57SsJVdG40GVXYgHAJgj3EJQhsstJxA3r > OZRx/tkNVXN1+66g/BoU5oX1jCnRhjdp6UYjBxYn3FXM8wjkYMoRqqWUdVp8 > QCxdetiKxKUq2tMfd6YIDJUIcLd5FJ5x02p3ZnTk7ERFtThyvw8R/hDJfys2 > 7OK8pA+wLPJJmhfBq5PUQBfs1desFvOEtjpNUIaASprQ4mMDkKAk99+tWeHE > a7Xi3/jVwIQNO4Pn9d8HBYGP+pnHLLVQQ//5SNtvPE42j4E3hpnUoq6cz67L > aty/oK2562o7d90YWsZwwT+8er7K8xvdEDFLpFxKgALppiutgP1pCQecATpa > i6lTODBV97Fo7DawWX7hW2t4Hc28JO1GRqj3Ve/fuv94helGzXVwngWVFpa2 > VJcmAiyZXNExDslLMdpsHgmagCEwQBo2hKnYf0QD3XfK7eWCSdEKhZIrhLlB > 9MuJRYMSf0HQdnqzTSMuzxAxPh3H/Wn/zJPciu07gWTRdxzFwq9TNpYe+P9s > C3p9fVasV5E2GU+GltwpeEeknkPiJj2exulyS/TRy1bMH9/wI+s8ch1Fb724 > dIWb > =IaDI > -----END PGP SIGNATURE----- > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com