I just applied the following settings to my cluster and it resulted in much
better behavior in the hosted VMs:

osd_backfill_scan_min = 2
osd_backfill_scan_max = 16
osd_recovery_max_active = 1
osd_max_backfills = 1
osd_recovery_threads = 1
osd_recovery_op_priority = 1

On my "canary" VM iowait dropped from a hard 50% or more to recurring wave
of nothing up to 25%, then down again, which is apparently low enough that
my users aren't noticing it. Recovery is of course taking much longer, but
since I can now do OSD maintenance operations during the day, it's a big
win.


QH

On Wed, Sep 16, 2015 at 9:42 AM, Robert LeBlanc <rob...@leblancnet.us>
wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> I was out of the office for a few days. We have some more hosts to
> add. I'll send some logs for examination.
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Fri, Sep 11, 2015 at 12:45 AM, GuangYang  wrote:
> > If we are talking about requests being blocked 60+ seconds, those
> tunings might not help (they help a lot for average latency during
> recovering/backfilling).
> >
> > It would be interesting to see the logs for those blocked requests at
> OSD side (they have level 0), pattern to search might be "slow requests \d+
> seconds old".
> >
> > I had a problem that for a recovery candidate object, all updates to
> that object would stuck until it is recovered, that might take extremely
> long time if there are large number of PG and objects to recover. But I
> think that is resolved by Sam to allow write for degraded objects in Hammer.
> >
> > ----------------------------------------
> >> Date: Thu, 10 Sep 2015 14:56:12 -0600
> >> From: rob...@leblancnet.us
> >> To: ceph-users@lists.ceph.com
> >> Subject: [ceph-users] Hammer reduce recovery impact
> >>
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA256
> >>
> >> We are trying to add some additional OSDs to our cluster, but the
> >> impact of the backfilling has been very disruptive to client I/O and
> >> we have been trying to figure out how to reduce the impact. We have
> >> seen some client I/O blocked for more than 60 seconds. There has been
> >> CPU and RAM head room on the OSD nodes, network has been fine, disks
> >> have been busy, but not terrible.
> >>
> >> 11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals
> >> (10GB), dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta
> >> S51G-1UL.
> >>
> >> Clients are QEMU VMs.
> >>
> >> [ulhglive-root@ceph5 current]# ceph --version
> >> ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
> >>
> >> Some nodes are 0.94.3
> >>
> >> [ulhglive-root@ceph5 current]# ceph status
> >> cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
> >> health HEALTH_WARN
> >> 3 pgs backfill
> >> 1 pgs backfilling
> >> 4 pgs stuck unclean
> >> recovery 2382/33044847 objects degraded (0.007%)
> >> recovery 50872/33044847 objects misplaced (0.154%)
> >> noscrub,nodeep-scrub flag(s) set
> >> monmap e2: 3 mons at
> >> {mon1=
> 10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
> >> election epoch 180, quorum 0,1,2 mon1,mon2,mon3
> >> osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
> >> flags noscrub,nodeep-scrub
> >> pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
> >> 128 TB used, 322 TB / 450 TB avail
> >> 2382/33044847 objects degraded (0.007%)
> >> 50872/33044847 objects misplaced (0.154%)
> >> 2300 active+clean
> >> 3 active+remapped+wait_backfill
> >> 1 active+remapped+backfilling
> >> recovery io 70401 kB/s, 16 objects/s
> >> client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s
> >>
> >> Each pool is size 4 with min_size 2.
> >>
> >> One problem we have is that the requirements of the cluster changed
> >> after setting up our pools, so our PGs are really out of wack. Our
> >> most active pool has only 256 PGs and each PG is about 120 GB is size.
> >> We are trying to clear out a pool that has way too many PGs so that we
> >> can split the PGs in that pool. I think these large PGs is part of our
> >> issues.
> >>
> >> Things I've tried:
> >>
> >> * Lowered nr_requests on the spindles from 1000 to 100. This reduced
> >> the max latency sometimes up to 3000 ms down to a max of 500-700 ms.
> >> it has also reduced the huge swings in latency, but has also reduced
> >> throughput somewhat.
> >> * Changed the scheduler from deadline to CFQ. I'm not sure if the the
> >> OSD process gives the recovery threads a different disk priority or if
> >> changing the scheduler without restarting the OSD allows the OSD to
> >> use disk priorities.
> >> * Reduced the number of osd_max_backfills from 2 to 1.
> >> * Tried setting noin to give the new OSDs time to get the PG map and
> >> peer before starting the backfill. This caused more problems than
> >> solved as we had blocked I/O (over 200 seconds) until we set the new
> >> OSDs to in.
> >>
> >> Even adding one OSD disk into the cluster is causing these slow I/O
> >> messages. We still have 5 more disks to add from this server and four
> >> more servers to add.
> >>
> >> In addition to trying to minimize these impacts, would it be better to
> >> split the PGs then add the rest of the servers, or add the servers
> >> then do the PG split. I'm thinking splitting first would be better,
> >> but I'd like to get other opinions.
> >>
> >> No spindle stays at high utilization for long and the await drops
> >> below 20 ms usually within 10 seconds so I/O should be serviced
> >> "pretty quick". My next guess is that the journals are getting full
> >> and blocking while waiting for flushes, but I'm not exactly sure how
> >> to identify that. We are using the defaults for the journal except for
> >> size (10G). We'd like to have journals large to handle bursts, but if
> >> they are getting filled with backfill traffic, it may be counter
> >> productive. Can/does backfill/recovery bypass the journal?
> >>
> >> Thanks,
> >>
> >> - ----------------
> >> Robert LeBlanc
> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
> >> -----BEGIN PGP SIGNATURE-----
> >> Version: Mailvelope v1.0.2
> >> Comment: https://www.mailvelope.com
> >>
> >> wsFcBAEBCAAQBQJV8e5qCRDmVDuy+mK58QAAaIwQAMN5DJlhrZkqwqsVXaKB
> >> nnegQjG6Y02ObLRrg96ghHr+AGgY/HRm3iShng6E1N9CL+XjcHSLeb1JqH9n
> >> 2SgGQGoRAU1dY6DIlOs5K8Fwd2bBECh863VymYbO+OLgtXbpp2mWfZZVAkTf
> >> V9ryaEh7tZOY1Mhx7mSIyr9Ur7IxTUOjzExAFPGfTLP1cbjE/FXoQMHh10fe
> >> zSzk/qK0AvajFD0PR04uRyEsGYeCLl68kGQi1R7IQlxZWc7hMhWXKNIFlbKB
> >> lk5+8OGx/LawW7qxpFm8a1SNoiAwMtrPKepvHYGi8u3rfXJa6ZE38jGuoqRs
> >> 8jD+b+gS0yxKbahT6S/gAEbgzAH0JF4YSz+nHNrvS6eSebykE9/7HGe9W7WA
> >> HRAkrESi/f1MKtRkud2Nhycx2R0MZLK/HoumnCN8WUmgvOtKsyYpXj6FXghv
> >> VGpi3r6uyC5Xlb8JGREqB1hAUTHAv0+z4biDBvPYrENwFUaerWiIujIeLWV9
> >> aYuiQBjjDCLoqWZj0+gQwn9/zXo8gE7jo3XAemYqGB8NJY1e+RZW6+TgC2rD
> >> Floa1en1PzZsynm1Ho+RPWW509kog5fFkt41nJmmxRi3kNWwiJfKLJvysetl
> >> RYudFG1cEumfI68VyNcuL4dMzf9FsiADsBaHue8g9a5bjJH8LjK4fKZDCCJf
> >> Rzgu
> >> =vlrz
> >> -----END PGP SIGNATURE-----
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.0.2
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJV+Y3lCRDmVDuy+mK58QAAtr8P/1OuWSPkPnw6DC6M0CA3
> E3qX6MVyRgSo8ajv4BtjNm+ZHidCtWFd8WdNGKGd3tRGCfdnLEUKtXXBuOA6
> rplXcFikciXfvwNTlyXefpX7Ppu57SsJVdG40GVXYgHAJgj3EJQhsstJxA3r
> OZRx/tkNVXN1+66g/BoU5oX1jCnRhjdp6UYjBxYn3FXM8wjkYMoRqqWUdVp8
> QCxdetiKxKUq2tMfd6YIDJUIcLd5FJ5x02p3ZnTk7ERFtThyvw8R/hDJfys2
> 7OK8pA+wLPJJmhfBq5PUQBfs1desFvOEtjpNUIaASprQ4mMDkKAk99+tWeHE
> a7Xi3/jVwIQNO4Pn9d8HBYGP+pnHLLVQQ//5SNtvPE42j4E3hpnUoq6cz67L
> aty/oK2562o7d90YWsZwwT+8er7K8xvdEDFLpFxKgALppiutgP1pCQecATpa
> i6lTODBV97Fo7DawWX7hW2t4Hc28JO1GRqj3Ve/fuv94helGzXVwngWVFpa2
> VJcmAiyZXNExDslLMdpsHgmagCEwQBo2hKnYf0QD3XfK7eWCSdEKhZIrhLlB
> 9MuJRYMSf0HQdnqzTSMuzxAxPh3H/Wn/zJPciu07gWTRdxzFwq9TNpYe+P9s
> C3p9fVasV5E2GU+GltwpeEeknkPiJj2exulyS/TRy1bMH9/wI+s8ch1Fb724
> dIWb
> =IaDI
> -----END PGP SIGNATURE-----
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to