Re: [ceph-users] Hammer reduce recovery impact

Somnath Roy Thu, 10 Sep 2015 15:23:52 -0700

I am not an expert on that, but, probably these settings will help backfill to 
go slow and thus less degradation on client IO. You may want to try..


Thanks & Regards
Somnath

-----Original Message-----
From: Robert LeBlanc [mailto:rob...@leblancnet.us] 
Sent: Thursday, September 10, 2015 3:16 PM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Hammer reduce recovery impact

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Do the recovery options kick in when there is only backfill going on?
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Sep 10, 2015 at 3:01 PM, Somnath Roy  wrote:
> Try all these..
>
> osd recovery max active = 1
> osd max backfills = 1
> osd recovery threads = 1
> osd recovery op priority = 1
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Robert LeBlanc
> Sent: Thursday, September 10, 2015 1:56 PM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Hammer reduce recovery impact
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> We are trying to add some additional OSDs to our cluster, but the impact of 
> the backfilling has been very disruptive to client I/O and we have been 
> trying to figure out how to reduce the impact. We have seen some client I/O 
> blocked for more than 60 seconds. There has been CPU and RAM head room on the 
> OSD nodes, network has been fine, disks have been busy, but not terrible.
>
> 11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals (10GB), 
> dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta S51G-1UL.
>
> Clients are QEMU VMs.
>
> [ulhglive-root@ceph5 current]# ceph --version ceph version 0.94.2 
> (5fb85614ca8f354284c713a2f9c610860720bbf3)
>
> Some nodes are 0.94.3
>
> [ulhglive-root@ceph5 current]# ceph status
>     cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
>      health HEALTH_WARN
>             3 pgs backfill
>             1 pgs backfilling
>             4 pgs stuck unclean
>             recovery 2382/33044847 objects degraded (0.007%)
>             recovery 50872/33044847 objects misplaced (0.154%)
>             noscrub,nodeep-scrub flag(s) set
>      monmap e2: 3 mons at
> {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
>             election epoch 180, quorum 0,1,2 mon1,mon2,mon3
>      osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
>             flags noscrub,nodeep-scrub
>       pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
>             128 TB used, 322 TB / 450 TB avail
>             2382/33044847 objects degraded (0.007%)
>             50872/33044847 objects misplaced (0.154%)
>                 2300 active+clean
>                    3 active+remapped+wait_backfill
>                    1 active+remapped+backfilling recovery io 70401 kB/s, 16 
> objects/s
>   client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s
>
> Each pool is size 4 with min_size 2.
>
> One problem we have is that the requirements of the cluster changed after 
> setting up our pools, so our PGs are really out of wack. Our most active pool 
> has only 256 PGs and each PG is about 120 GB is size.
> We are trying to clear out a pool that has way too many PGs so that we can 
> split the PGs in that pool. I think these large PGs is part of our issues.
>
> Things I've tried:
>
> * Lowered nr_requests on the spindles from 1000 to 100. This reduced the max 
> latency sometimes up to 3000 ms down to a max of 500-700 ms.
> it has also reduced the huge swings in  latency, but has also reduced 
> throughput somewhat.
> * Changed the scheduler from deadline to CFQ. I'm not sure if the the OSD 
> process gives the recovery threads a different disk priority or if changing 
> the scheduler without restarting the OSD allows the OSD to use disk 
> priorities.
> * Reduced the number of osd_max_backfills from 2 to 1.
> * Tried setting noin to give the new OSDs time to get the PG map and peer 
> before starting the backfill. This caused more problems than solved as we had 
> blocked I/O (over 200 seconds) until we set the new OSDs to in.
>
> Even adding one OSD disk into the cluster is causing these slow I/O messages. 
> We still have 5 more disks to add from this server and four more servers to 
> add.
>
> In addition to trying to minimize these impacts, would it be better to split 
> the PGs then add the rest of the servers, or add the servers then do the PG 
> split. I'm thinking splitting first would be better, but I'd like to get 
> other opinions.
>
> No spindle stays at high utilization for long and the await drops below 20 ms 
> usually within 10 seconds so I/O should be serviced "pretty quick". My next 
> guess is that the journals are getting full and blocking while waiting for 
> flushes, but I'm not exactly sure how to identify that. We are using the 
> defaults for the journal except for size (10G). We'd like to have journals 
> large to handle bursts, but if they are getting filled with backfill traffic, 
> it may be counter productive. Can/does backfill/recovery bypass the journal?
>
> Thanks,
>
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1 -----BEGIN 
> PGP SIGNATURE-----
> Version: Mailvelope v1.0.2
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJV8e5qCRDmVDuy+mK58QAAaIwQAMN5DJlhrZkqwqsVXaKB
> nnegQjG6Y02ObLRrg96ghHr+AGgY/HRm3iShng6E1N9CL+XjcHSLeb1JqH9n
> 2SgGQGoRAU1dY6DIlOs5K8Fwd2bBECh863VymYbO+OLgtXbpp2mWfZZVAkTf
> V9ryaEh7tZOY1Mhx7mSIyr9Ur7IxTUOjzExAFPGfTLP1cbjE/FXoQMHh10fe
> zSzk/qK0AvajFD0PR04uRyEsGYeCLl68kGQi1R7IQlxZWc7hMhWXKNIFlbKB
> lk5+8OGx/LawW7qxpFm8a1SNoiAwMtrPKepvHYGi8u3rfXJa6ZE38jGuoqRs
> 8jD+b+gS0yxKbahT6S/gAEbgzAH0JF4YSz+nHNrvS6eSebykE9/7HGe9W7WA
> HRAkrESi/f1MKtRkud2Nhycx2R0MZLK/HoumnCN8WUmgvOtKsyYpXj6FXghv
> VGpi3r6uyC5Xlb8JGREqB1hAUTHAv0+z4biDBvPYrENwFUaerWiIujIeLWV9
> aYuiQBjjDCLoqWZj0+gQwn9/zXo8gE7jo3XAemYqGB8NJY1e+RZW6+TgC2rD
> Floa1en1PzZsynm1Ho+RPWW509kog5fFkt41nJmmxRi3kNWwiJfKLJvysetl
> RYudFG1cEumfI68VyNcuL4dMzf9FsiADsBaHue8g9a5bjJH8LjK4fKZDCCJf
> Rzgu
> =vlrz
> -----END PGP SIGNATURE-----
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
>

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV8gEnCRDmVDuy+mK58QAAQ7QQAJjm1tu9Tp8q+TPXS6k/
+MXfpW28p54y67gfBcGHSOJd/VzJsIytFeO9Q5r6uA3U+JFvxVeN8Jpbp8qF
JyjAR2qttW5MnOcZm8Zf8VI6RVNfCXw9KIqCtO8ZWN89JKNg0ImXqMKOK5rL
wg1wuk+fFF8PvJlweQS9xOFdXgxfnMXlLfXoYccHzRsRyTHIixrVED1vWgAA
oLSOYySPaLTjJLfaBIu1M4tb60BLA9Z1rNsHZPLEODGZCCCFEwxjYB+hzDtS
BbnShRU89rlzkixW22NzGEbjLBUR9stRMfRGDAd8iHOiisqmrJJkiVK/3ZSX
kyj+aXLE0pCS/Z/w+Utyg0B1jc6kwUoAcdE8q1OMYUEUCC39ZQxJhtJLDarF
vn/XUCBrDu5f/sVt8z2fjxdQIBvX7fYFN9Quf0gvlXVico+gu3lEBezXzSDX
gIAJu6B1RoWL445reDZbdPE5ZaXQP/HkcDhwIL6h0i+1PLjPw6dyR9mJ65OR
Byor/5/tfCOuH6nTgBYNa2Ty4FHx0FzlwVLeUktRlameQ/XoLf51ZIncR/XZ
rl/lrizRvAm0jMJL11IvMcjnPUZxTBcqJmgk4Zq1w1I62smtZ7gw5C0T/dDv
oi5/vpDzgDiASEd8GNA5pYsZZHtZicSXzFGbBdj/FwsIJGneTzbUMN/2M9ND
nHow
=+qf1
-----END PGP SIGNATURE-----
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Hammer reduce recovery impact

Reply via email to