Re: [ceph-users] Hammer reduce recovery impact

2015-09-16 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I was out of the office for a few days. We have some more hosts to
add. I'll send some logs for examination.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Sep 11, 2015 at 12:45 AM, GuangYang  wrote:
> If we are talking about requests being blocked 60+ seconds, those tunings 
> might not help (they help a lot for average latency during 
> recovering/backfilling).
>
> It would be interesting to see the logs for those blocked requests at OSD 
> side (they have level 0), pattern to search might be "slow requests \d+ 
> seconds old".
>
> I had a problem that for a recovery candidate object, all updates to that 
> object would stuck until it is recovered, that might take extremely long time 
> if there are large number of PG and objects to recover. But I think that is 
> resolved by Sam to allow write for degraded objects in Hammer.
>
> 
>> Date: Thu, 10 Sep 2015 14:56:12 -0600
>> From: rob...@leblancnet.us
>> To: ceph-users@lists.ceph.com
>> Subject: [ceph-users] Hammer reduce recovery impact
>>
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> We are trying to add some additional OSDs to our cluster, but the
>> impact of the backfilling has been very disruptive to client I/O and
>> we have been trying to figure out how to reduce the impact. We have
>> seen some client I/O blocked for more than 60 seconds. There has been
>> CPU and RAM head room on the OSD nodes, network has been fine, disks
>> have been busy, but not terrible.
>>
>> 11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals
>> (10GB), dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta
>> S51G-1UL.
>>
>> Clients are QEMU VMs.
>>
>> [ulhglive-root@ceph5 current]# ceph --version
>> ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
>>
>> Some nodes are 0.94.3
>>
>> [ulhglive-root@ceph5 current]# ceph status
>> cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
>> health HEALTH_WARN
>> 3 pgs backfill
>> 1 pgs backfilling
>> 4 pgs stuck unclean
>> recovery 2382/33044847 objects degraded (0.007%)
>> recovery 50872/33044847 objects misplaced (0.154%)
>> noscrub,nodeep-scrub flag(s) set
>> monmap e2: 3 mons at
>> {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
>> election epoch 180, quorum 0,1,2 mon1,mon2,mon3
>> osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
>> flags noscrub,nodeep-scrub
>> pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
>> 128 TB used, 322 TB / 450 TB avail
>> 2382/33044847 objects degraded (0.007%)
>> 50872/33044847 objects misplaced (0.154%)
>> 2300 active+clean
>> 3 active+remapped+wait_backfill
>> 1 active+remapped+backfilling
>> recovery io 70401 kB/s, 16 objects/s
>> client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s
>>
>> Each pool is size 4 with min_size 2.
>>
>> One problem we have is that the requirements of the cluster changed
>> after setting up our pools, so our PGs are really out of wack. Our
>> most active pool has only 256 PGs and each PG is about 120 GB is size.
>> We are trying to clear out a pool that has way too many PGs so that we
>> can split the PGs in that pool. I think these large PGs is part of our
>> issues.
>>
>> Things I've tried:
>>
>> * Lowered nr_requests on the spindles from 1000 to 100. This reduced
>> the max latency sometimes up to 3000 ms down to a max of 500-700 ms.
>> it has also reduced the huge swings in latency, but has also reduced
>> throughput somewhat.
>> * Changed the scheduler from deadline to CFQ. I'm not sure if the the
>> OSD process gives the recovery threads a different disk priority or if
>> changing the scheduler without restarting the OSD allows the OSD to
>> use disk priorities.
>> * Reduced the number of osd_max_backfills from 2 to 1.
>> * Tried setting noin to give the new OSDs time to get the PG map and
>> peer before starting the backfill. This caused more problems than
>> solved as we had blocked I/O (over 200 seconds) until we set the new
>> OSDs to in.
>>
>> Even adding one OSD disk into the cluster is causing these slow I/O
>> messages. We still have 5 more disks to add from this server and four
>> more servers to add.
>>
>> In addition to trying to minimize these impacts, would it be better to
>> split the PGs then add the rest of the servers, or add the servers
>> then do the PG split. I'm thinking splitting first would be better,
>> but I'd like to get other opinions.
>>
>> No spindle stays at high utilization for long and the await drops
>> below 20 ms usually within 10 seconds so I/O should be serviced
>> "pretty quick". My next guess is that the journals are getting full
>> and blocking while waiting for flushes, but I'm not exactly sure how
>> to identify that. We are using the defaults for the journal except for
>> size (10G). We'd like to have journals large to handle bursts, but if
>> they are getting filled with backfill 

Re: [ceph-users] Hammer reduce recovery impact

2015-09-11 Thread GuangYang
If we are talking about requests being blocked 60+ seconds, those tunings might 
not help (they help a lot for average latency during recovering/backfilling).

It would be interesting to see the logs for those blocked requests at OSD side 
(they have level 0), pattern to search might be "slow requests \d+ seconds old".

I had a problem that for a recovery candidate object, all updates to that 
object would stuck until it is recovered, that might take extremely long time 
if there are large number of PG and objects to recover. But I think that is 
resolved by Sam to allow write for degraded objects in Hammer.


> Date: Thu, 10 Sep 2015 14:56:12 -0600
> From: rob...@leblancnet.us
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Hammer reduce recovery impact
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> We are trying to add some additional OSDs to our cluster, but the
> impact of the backfilling has been very disruptive to client I/O and
> we have been trying to figure out how to reduce the impact. We have
> seen some client I/O blocked for more than 60 seconds. There has been
> CPU and RAM head room on the OSD nodes, network has been fine, disks
> have been busy, but not terrible.
>
> 11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals
> (10GB), dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta
> S51G-1UL.
>
> Clients are QEMU VMs.
>
> [ulhglive-root@ceph5 current]# ceph --version
> ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
>
> Some nodes are 0.94.3
>
> [ulhglive-root@ceph5 current]# ceph status
> cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
> health HEALTH_WARN
> 3 pgs backfill
> 1 pgs backfilling
> 4 pgs stuck unclean
> recovery 2382/33044847 objects degraded (0.007%)
> recovery 50872/33044847 objects misplaced (0.154%)
> noscrub,nodeep-scrub flag(s) set
> monmap e2: 3 mons at
> {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
> election epoch 180, quorum 0,1,2 mon1,mon2,mon3
> osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
> flags noscrub,nodeep-scrub
> pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
> 128 TB used, 322 TB / 450 TB avail
> 2382/33044847 objects degraded (0.007%)
> 50872/33044847 objects misplaced (0.154%)
> 2300 active+clean
> 3 active+remapped+wait_backfill
> 1 active+remapped+backfilling
> recovery io 70401 kB/s, 16 objects/s
> client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s
>
> Each pool is size 4 with min_size 2.
>
> One problem we have is that the requirements of the cluster changed
> after setting up our pools, so our PGs are really out of wack. Our
> most active pool has only 256 PGs and each PG is about 120 GB is size.
> We are trying to clear out a pool that has way too many PGs so that we
> can split the PGs in that pool. I think these large PGs is part of our
> issues.
>
> Things I've tried:
>
> * Lowered nr_requests on the spindles from 1000 to 100. This reduced
> the max latency sometimes up to 3000 ms down to a max of 500-700 ms.
> it has also reduced the huge swings in latency, but has also reduced
> throughput somewhat.
> * Changed the scheduler from deadline to CFQ. I'm not sure if the the
> OSD process gives the recovery threads a different disk priority or if
> changing the scheduler without restarting the OSD allows the OSD to
> use disk priorities.
> * Reduced the number of osd_max_backfills from 2 to 1.
> * Tried setting noin to give the new OSDs time to get the PG map and
> peer before starting the backfill. This caused more problems than
> solved as we had blocked I/O (over 200 seconds) until we set the new
> OSDs to in.
>
> Even adding one OSD disk into the cluster is causing these slow I/O
> messages. We still have 5 more disks to add from this server and four
> more servers to add.
>
> In addition to trying to minimize these impacts, would it be better to
> split the PGs then add the rest of the servers, or add the servers
> then do the PG split. I'm thinking splitting first would be better,
> but I'd like to get other opinions.
>
> No spindle stays at high utilization for long and the await drops
> below 20 ms usually within 10 seconds so I/O should be serviced
> "pretty quick". My next guess is that the journals are getting full
> and blocking while waiting for flushes, but I'm not exactly sure how
> to identify that. We are using the defaults for the journal except for
> size (10G). We'd like to have journals large to handle bursts, but if
> they are getting filled with backfill traffic, it may be counter
> productive. Can/does backfill/recovery bypass the journal?
>
> Thanks,
>
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.0.2
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJV8e5qCRDmVDuy+mK58QAAaIwQAMN5DJlhrZkqwqsVXaKB
> 

Re: [ceph-users] Hammer reduce recovery impact

2015-09-11 Thread Paweł Sadowski
On 09/10/2015 10:56 PM, Robert LeBlanc wrote:
> Things I've tried:
>
> * Lowered nr_requests on the spindles from 1000 to 100. This reduced
> the max latency sometimes up to 3000 ms down to a max of 500-700 ms.
> it has also reduced the huge swings in  latency, but has also reduced
> throughput somewhat.
> * Changed the scheduler from deadline to CFQ. I'm not sure if the the
> OSD process gives the recovery threads a different disk priority or if
> changing the scheduler without restarting the OSD allows the OSD to
> use disk priorities.
> * Reduced the number of osd_max_backfills from 2 to 1.
> * Tried setting noin to give the new OSDs time to get the PG map and
> peer before starting the backfill. This caused more problems than
> solved as we had blocked I/O (over 200 seconds) until we set the new
> OSDs to in.

You can also try to lower this settings (from the default):

  "osd_backfill_scan_min": "64",
  "osd_backfill_scan_max": "512",

In our case we've set them to 1 and 8. And it helps a lot but recovery
will take more time.

-- 
PS

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Christian Balzer

Hello,

On Thu, 10 Sep 2015 16:16:10 -0600 Robert LeBlanc wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> Do the recovery options kick in when there is only backfill going on?
>
Aside from having these set just in case as your cluster (and one of mine)
is clearly at the limits of its abilities, that's a good question.

Recovery and backfill are a bit blurry and clearly can happen at the same
time when locking at my logs from yesterday, when testing ways on how to
ease in new OSDs on my test cluster.

It would be nice if somebody in the know aka Devs would pipe up here.

What happens in the following scenarios?

1. OSD fails, is set out, etc. PGs get moved around. -> Recovery
2. Same OSD is brought back in. PGs move to their original OSDs. Recovery
or backfill?
3. New bucket (host or OSD) is added to the crush map, causing minor PG
reshuffles. Recovery or backfill?
4. The same OSD added in 3 is set "in", started. Backfill, one would
assume.

But this is a log entry from a situation like 4:
---
2015-09-10 15:53:30.084063 mon.0 203.216.0.33:6789/0 6254 : [INF] pgmap 
v791755: 896 pgs: 45 active+remapped+wait_backfill, 2 
active+remapped+backfilling, 
10 active+recovery_wait, 839 active+clean; 69546 MB data, 303 GB used, 5323 GB 
/ 5665 GB avail; 2925/54958 objects degraded (5.322%); 15638 kB/s, 3 objects/s 
recover
ing
---

I read that as both backfilling and recovery going on at the same time.

Christian
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Thu, Sep 10, 2015 at 3:01 PM, Somnath Roy  wrote:
> > Try all these..
> >
> > osd recovery max active = 1
> > osd max backfills = 1
> > osd recovery threads = 1
> > osd recovery op priority = 1
> >
> > Thanks & Regards
> > Somnath
> >
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Robert LeBlanc Sent: Thursday, September 10, 2015 1:56 PM
> > To: ceph-users@lists.ceph.com
> > Subject: [ceph-users] Hammer reduce recovery impact
> >
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA256
> >
> > We are trying to add some additional OSDs to our cluster, but the
> > impact of the backfilling has been very disruptive to client I/O and
> > we have been trying to figure out how to reduce the impact. We have
> > seen some client I/O blocked for more than 60 seconds. There has been
> > CPU and RAM head room on the OSD nodes, network has been fine, disks
> > have been busy, but not terrible.
> >
> > 11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals
> > (10GB), dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta
> > S51G-1UL.
> >
> > Clients are QEMU VMs.
> >
> > [ulhglive-root@ceph5 current]# ceph --version ceph version 0.94.2
> > (5fb85614ca8f354284c713a2f9c610860720bbf3)
> >
> > Some nodes are 0.94.3
> >
> > [ulhglive-root@ceph5 current]# ceph status
> > cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
> >  health HEALTH_WARN
> > 3 pgs backfill
> > 1 pgs backfilling
> > 4 pgs stuck unclean
> > recovery 2382/33044847 objects degraded (0.007%)
> > recovery 50872/33044847 objects misplaced (0.154%)
> > noscrub,nodeep-scrub flag(s) set
> >  monmap e2: 3 mons at
> > {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
> > election epoch 180, quorum 0,1,2 mon1,mon2,mon3
> >  osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
> > flags noscrub,nodeep-scrub
> >   pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
> > 128 TB used, 322 TB / 450 TB avail
> > 2382/33044847 objects degraded (0.007%)
> > 50872/33044847 objects misplaced (0.154%)
> > 2300 active+clean
> >3 active+remapped+wait_backfill
> >1 active+remapped+backfilling recovery io 70401
> > kB/s, 16 objects/s client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s
> >
> > Each pool is size 4 with min_size 2.
> >
> > One problem we have is that the requirements of the cluster changed
> > after setting up our pools, so our PGs are really out of wack. Our
> > most active pool has only 256 PGs and each PG is about 120 GB is size.
> > We are trying to clear out a pool that has way too many PGs so that we
> > can split the PGs in that pool. I think these large PGs is part of our
> > issues.
> >
> > Things I've tried:
> >
> > * Lowered nr_requests on the spindles from 1000 to 100. This reduced
> > the max latency sometimes up to 3000 ms down to a max of 500-700 ms.
> > it has also reduced the huge swings in  latency, but has also reduced
> > throughput somewhat.
> > * Changed the scheduler from deadline to CFQ. I'm not sure if the the
> > OSD process gives the recovery threads a different disk priority or if
> > changing the scheduler without restarting the OSD allows the OSD to
> > use disk priorities.
> > * 

Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Somnath Roy
I am not an expert on that, but, probably these settings will help backfill to 
go slow and thus less degradation on client IO. You may want to try..

Thanks & Regards
Somnath

-Original Message-
From: Robert LeBlanc [mailto:rob...@leblancnet.us] 
Sent: Thursday, September 10, 2015 3:16 PM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Hammer reduce recovery impact

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Do the recovery options kick in when there is only backfill going on?
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Sep 10, 2015 at 3:01 PM, Somnath Roy  wrote:
> Try all these..
>
> osd recovery max active = 1
> osd max backfills = 1
> osd recovery threads = 1
> osd recovery op priority = 1
>
> Thanks & Regards
> Somnath
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Robert LeBlanc
> Sent: Thursday, September 10, 2015 1:56 PM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Hammer reduce recovery impact
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> We are trying to add some additional OSDs to our cluster, but the impact of 
> the backfilling has been very disruptive to client I/O and we have been 
> trying to figure out how to reduce the impact. We have seen some client I/O 
> blocked for more than 60 seconds. There has been CPU and RAM head room on the 
> OSD nodes, network has been fine, disks have been busy, but not terrible.
>
> 11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals (10GB), 
> dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta S51G-1UL.
>
> Clients are QEMU VMs.
>
> [ulhglive-root@ceph5 current]# ceph --version ceph version 0.94.2 
> (5fb85614ca8f354284c713a2f9c610860720bbf3)
>
> Some nodes are 0.94.3
>
> [ulhglive-root@ceph5 current]# ceph status
> cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
>  health HEALTH_WARN
> 3 pgs backfill
> 1 pgs backfilling
> 4 pgs stuck unclean
> recovery 2382/33044847 objects degraded (0.007%)
> recovery 50872/33044847 objects misplaced (0.154%)
> noscrub,nodeep-scrub flag(s) set
>  monmap e2: 3 mons at
> {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
> election epoch 180, quorum 0,1,2 mon1,mon2,mon3
>  osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
> flags noscrub,nodeep-scrub
>   pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
> 128 TB used, 322 TB / 450 TB avail
> 2382/33044847 objects degraded (0.007%)
> 50872/33044847 objects misplaced (0.154%)
> 2300 active+clean
>3 active+remapped+wait_backfill
>1 active+remapped+backfilling recovery io 70401 kB/s, 16 
> objects/s
>   client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s
>
> Each pool is size 4 with min_size 2.
>
> One problem we have is that the requirements of the cluster changed after 
> setting up our pools, so our PGs are really out of wack. Our most active pool 
> has only 256 PGs and each PG is about 120 GB is size.
> We are trying to clear out a pool that has way too many PGs so that we can 
> split the PGs in that pool. I think these large PGs is part of our issues.
>
> Things I've tried:
>
> * Lowered nr_requests on the spindles from 1000 to 100. This reduced the max 
> latency sometimes up to 3000 ms down to a max of 500-700 ms.
> it has also reduced the huge swings in  latency, but has also reduced 
> throughput somewhat.
> * Changed the scheduler from deadline to CFQ. I'm not sure if the the OSD 
> process gives the recovery threads a different disk priority or if changing 
> the scheduler without restarting the OSD allows the OSD to use disk 
> priorities.
> * Reduced the number of osd_max_backfills from 2 to 1.
> * Tried setting noin to give the new OSDs time to get the PG map and peer 
> before starting the backfill. This caused more problems than solved as we had 
> blocked I/O (over 200 seconds) until we set the new OSDs to in.
>
> Even adding one OSD disk into the cluster is causing these slow I/O messages. 
> We still have 5 more disks to add from this server and four more servers to 
> add.
>
> In addition to trying to minimize these impacts, would it be better to split 
> the PGs then add the rest of the servers, or add the servers then do the PG 
> split. I'm thinking splitting first would be better, but I'd like to get 
> other opinions.
>
> No spindle stays at high utilization for long and the await drops below 20 ms 
> usually 

Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Lincoln Bryant

On 9/10/2015 5:39 PM, Lionel Bouton wrote:
For example deep-scrubs were a problem on our installation when at 
times there were several going on. We implemented a scheduler that 
enforces limits on simultaneous deep-scrubs and these problems are gone.


Hi Lionel,

Out of curiosity, how many was "several" in your case?

Cheers,
Lincoln
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Lionel Bouton
Le 10/09/2015 22:56, Robert LeBlanc a écrit :
> We are trying to add some additional OSDs to our cluster, but the
> impact of the backfilling has been very disruptive to client I/O and
> we have been trying to figure out how to reduce the impact. We have
> seen some client I/O blocked for more than 60 seconds. There has been
> CPU and RAM head room on the OSD nodes, network has been fine, disks
> have been busy, but not terrible.

It seems you've already exhausted most of the ways I know. When
confronted to this situation, I used a simple script to throttle
backfills (freezing them, then re-enabling them), this helped our VMs at
the time but you must be prepared for very long migrations and some
experimentations with different schedulings. You simply pass it the
number of seconds backfills are allowed to proceed then the number of
seconds during them they pause.

Here's the script, which should be self-explanatory:
http://pastebin.com/sy7h1VEy

something like :

./throttler 10 120

limited the impact on our VMs (the idea being that during the 10s the
backfill won't be able to trigger filestore syncs and the 120s pause
will allow the filestore syncs to remove "dirty" data from the journals
without interfering too much with concurrent writes).
I believe you must have a high filestore sync value to hope to benefit
from this (we use 30s).
At the very least the long pause will eventually allow VMs to move data
to disk regularly instead of being nearly frozen.

Note that your pgs are more than 10G each, if the OSDs can't stop a
backfill before finishing transferring the current pg this won't help (I
assume backfills go through journals and they probably won't be able to
act as write-back caches anymore as one PG will be enough to fill them up).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Do the recovery options kick in when there is only backfill going on?
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Sep 10, 2015 at 3:01 PM, Somnath Roy  wrote:
> Try all these..
>
> osd recovery max active = 1
> osd max backfills = 1
> osd recovery threads = 1
> osd recovery op priority = 1
>
> Thanks & Regards
> Somnath
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Robert LeBlanc
> Sent: Thursday, September 10, 2015 1:56 PM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Hammer reduce recovery impact
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> We are trying to add some additional OSDs to our cluster, but the impact of 
> the backfilling has been very disruptive to client I/O and we have been 
> trying to figure out how to reduce the impact. We have seen some client I/O 
> blocked for more than 60 seconds. There has been CPU and RAM head room on the 
> OSD nodes, network has been fine, disks have been busy, but not terrible.
>
> 11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals (10GB), 
> dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta S51G-1UL.
>
> Clients are QEMU VMs.
>
> [ulhglive-root@ceph5 current]# ceph --version ceph version 0.94.2 
> (5fb85614ca8f354284c713a2f9c610860720bbf3)
>
> Some nodes are 0.94.3
>
> [ulhglive-root@ceph5 current]# ceph status
> cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
>  health HEALTH_WARN
> 3 pgs backfill
> 1 pgs backfilling
> 4 pgs stuck unclean
> recovery 2382/33044847 objects degraded (0.007%)
> recovery 50872/33044847 objects misplaced (0.154%)
> noscrub,nodeep-scrub flag(s) set
>  monmap e2: 3 mons at
> {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
> election epoch 180, quorum 0,1,2 mon1,mon2,mon3
>  osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
> flags noscrub,nodeep-scrub
>   pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
> 128 TB used, 322 TB / 450 TB avail
> 2382/33044847 objects degraded (0.007%)
> 50872/33044847 objects misplaced (0.154%)
> 2300 active+clean
>3 active+remapped+wait_backfill
>1 active+remapped+backfilling recovery io 70401 kB/s, 16 
> objects/s
>   client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s
>
> Each pool is size 4 with min_size 2.
>
> One problem we have is that the requirements of the cluster changed after 
> setting up our pools, so our PGs are really out of wack. Our most active pool 
> has only 256 PGs and each PG is about 120 GB is size.
> We are trying to clear out a pool that has way too many PGs so that we can 
> split the PGs in that pool. I think these large PGs is part of our issues.
>
> Things I've tried:
>
> * Lowered nr_requests on the spindles from 1000 to 100. This reduced the max 
> latency sometimes up to 3000 ms down to a max of 500-700 ms.
> it has also reduced the huge swings in  latency, but has also reduced 
> throughput somewhat.
> * Changed the scheduler from deadline to CFQ. I'm not sure if the the OSD 
> process gives the recovery threads a different disk priority or if changing 
> the scheduler without restarting the OSD allows the OSD to use disk 
> priorities.
> * Reduced the number of osd_max_backfills from 2 to 1.
> * Tried setting noin to give the new OSDs time to get the PG map and peer 
> before starting the backfill. This caused more problems than solved as we had 
> blocked I/O (over 200 seconds) until we set the new OSDs to in.
>
> Even adding one OSD disk into the cluster is causing these slow I/O messages. 
> We still have 5 more disks to add from this server and four more servers to 
> add.
>
> In addition to trying to minimize these impacts, would it be better to split 
> the PGs then add the rest of the servers, or add the servers then do the PG 
> split. I'm thinking splitting first would be better, but I'd like to get 
> other opinions.
>
> No spindle stays at high utilization for long and the await drops below 20 ms 
> usually within 10 seconds so I/O should be serviced "pretty quick". My next 
> guess is that the journals are getting full and blocking while waiting for 
> flushes, but I'm not exactly sure how to identify that. We are using the 
> defaults for the journal except for size (10G). We'd like to have journals 
> large to handle bursts, but if they are getting filled with backfill traffic, 
> it may be counter productive. Can/does backfill/recovery bypass the journal?
>
> Thanks,
>
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1 -BEGIN 
> PGP SIGNATURE-
> Version: Mailvelope v1.0.2
> Comment: https://www.mailvelope.com
>
> 

Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I don't think the script will help our situation as it is just setting
osd_max_backfill from 1 to 0. It looks like that change doesn't go
into effect until after it finishes the PG. It would be nice if
backfill/recovery would skip the journal, but there would have to be
some logic if the obect was changed as it was being replicated. Maybe
just a log in the journal that the objects are starting restore and
finished restore, then the journal flush knows if it needs to commit
the write?
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Sep 10, 2015 at 3:33 PM, Lionel Bouton  wrote:
> Le 10/09/2015 22:56, Robert LeBlanc a écrit :
>> We are trying to add some additional OSDs to our cluster, but the
>> impact of the backfilling has been very disruptive to client I/O and
>> we have been trying to figure out how to reduce the impact. We have
>> seen some client I/O blocked for more than 60 seconds. There has been
>> CPU and RAM head room on the OSD nodes, network has been fine, disks
>> have been busy, but not terrible.
>
> It seems you've already exhausted most of the ways I know. When
> confronted to this situation, I used a simple script to throttle
> backfills (freezing them, then re-enabling them), this helped our VMs at
> the time but you must be prepared for very long migrations and some
> experimentations with different schedulings. You simply pass it the
> number of seconds backfills are allowed to proceed then the number of
> seconds during them they pause.
>
> Here's the script, which should be self-explanatory:
> http://pastebin.com/sy7h1VEy
>
> something like :
>
> ./throttler 10 120
>
> limited the impact on our VMs (the idea being that during the 10s the
> backfill won't be able to trigger filestore syncs and the 120s pause
> will allow the filestore syncs to remove "dirty" data from the journals
> without interfering too much with concurrent writes).
> I believe you must have a high filestore sync value to hope to benefit
> from this (we use 30s).
> At the very least the long pause will eventually allow VMs to move data
> to disk regularly instead of being nearly frozen.
>
> Note that your pgs are more than 10G each, if the OSDs can't stop a
> backfill before finishing transferring the current pg this won't help (I
> assume backfills go through journals and they probably won't be able to
> act as write-back caches anymore as one PG will be enough to fill them up).
>
> Best regards,
>
> Lionel

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV8gIbCRDmVDuy+mK58QAAGOgQAMLGgbrgsHF2n9ZVGxol
4X1jezsXAjrPc19U38u8JLv1kVsSal6MBh+uSt1O6RnHWT+fMYOh1knPSYgl
aWvjYP9yJ+yVnWtuz5YxRI45WJ8XvJ8V7FPUYLRxSId7IX4EToupUf30AjdD
KZfjfLgpNKz98UMmFBRporTsvIX1cHGVtN7tiqhAtRPQYMhgXCA2pyqUFkhJ
H86287DZnnXrlDOsT7e+0Gel+eYKjUF7QsUYKCUMVx1Mj5oAm9gC0ZIm+icS
YIeUOzIO8LGV3YXHWmUQClzV9w0uQ7CBvvLoCBbFjvQOgQizsOUpgXv818Fr
Fp6ihpoNKDGaQ7lylLmT8Yu4Rf+JFQn3xfLBE0lPg41CkI8/MQIQsyYLlr5D
Pdd1msxy14Y1lvRbwsNnn+ICzvz/YhbuwtTSVFT+EnRSwc+fkRhKi1ipB1Zx
5zyvVI0ge8SRIelXYfueBmC/LCxjYp9ntfSSQujxlVejgUCxmG3HTd3TvBcn
SdyA7F5sQOpOSK+Hc/eRGwxYgWq4r/jd3TJQt6F2qRHi/nx2K4oFFv6r6SgT
zkDdZewlE+kVx8GkKnB4h1xI3DhGsIyPaS7rCSqy1DrMmxUSFFGgYto7umok
s5cpOeq35owbiv9Da8t3MCzoZvYfhuXCitWn+Jl69v5vfGHm6ha4A59mcigz
S9DN
=6xla
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Lionel Bouton
Le 11/09/2015 00:20, Robert LeBlanc a écrit :
> I don't think the script will help our situation as it is just setting
> osd_max_backfill from 1 to 0. It looks like that change doesn't go
> into effect until after it finishes the PG.

That was what I was afraid of. Note that it should help a little anyway
(if not that's worrying, setting backfills to 0 completely should solve
your clients IO problems in a matter of minutes).
You may have better results by allowing backfills on only a few of your
OSD at a time. For example deep-scrubs were a problem on our
installation when at times there were several going on. We implemented a
scheduler that enforces limits on simultaneous deep-scrubs and these
problems are gone.
That's a last resort and rough around the edges but if every other means
of reducing the impact on your clients has failed, that's the best you
can hope for.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Lionel Bouton
Le 11/09/2015 01:24, Lincoln Bryant a écrit :
> On 9/10/2015 5:39 PM, Lionel Bouton wrote:
>> For example deep-scrubs were a problem on our installation when at
>> times there were several going on. We implemented a scheduler that
>> enforces limits on simultaneous deep-scrubs and these problems are gone.
>
> Hi Lionel,
>
> Out of curiosity, how many was "several" in your case?

I had to issue ceph osd set nodeep-scrub several times with 3 or 4
concurrent deep-scrubs to avoid processes blocked in D state on VMs and
I could see the VM loads start rising with only 2. At the time I had
only 3 or 4 servers with 18 or 24 OSDs on Firefly. Obviously the more
servers and OSDs you have the more simultaneous deep scrubs you can handle.

One PG is ~5GB on our installation and it was probably ~4GB at the time.
As deep scrubs must read data on all replicas, with size=3 having 3 or 4
concurrent on only 3 or 4 servers means reading anywhere between 10 to
20G from disks on each server (and I don't think the OSDs are trying to
bypass the kernel cache).

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Somnath Roy
Try all these..

osd recovery max active = 1
osd max backfills = 1
osd recovery threads = 1
osd recovery op priority = 1

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Robert 
LeBlanc
Sent: Thursday, September 10, 2015 1:56 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Hammer reduce recovery impact

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We are trying to add some additional OSDs to our cluster, but the impact of the 
backfilling has been very disruptive to client I/O and we have been trying to 
figure out how to reduce the impact. We have seen some client I/O blocked for 
more than 60 seconds. There has been CPU and RAM head room on the OSD nodes, 
network has been fine, disks have been busy, but not terrible.

11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals (10GB), 
dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta S51G-1UL.

Clients are QEMU VMs.

[ulhglive-root@ceph5 current]# ceph --version ceph version 0.94.2 
(5fb85614ca8f354284c713a2f9c610860720bbf3)

Some nodes are 0.94.3

[ulhglive-root@ceph5 current]# ceph status
cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
 health HEALTH_WARN
3 pgs backfill
1 pgs backfilling
4 pgs stuck unclean
recovery 2382/33044847 objects degraded (0.007%)
recovery 50872/33044847 objects misplaced (0.154%)
noscrub,nodeep-scrub flag(s) set
 monmap e2: 3 mons at
{mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
election epoch 180, quorum 0,1,2 mon1,mon2,mon3
 osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
flags noscrub,nodeep-scrub
  pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
128 TB used, 322 TB / 450 TB avail
2382/33044847 objects degraded (0.007%)
50872/33044847 objects misplaced (0.154%)
2300 active+clean
   3 active+remapped+wait_backfill
   1 active+remapped+backfilling recovery io 70401 kB/s, 16 
objects/s
  client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s

Each pool is size 4 with min_size 2.

One problem we have is that the requirements of the cluster changed after 
setting up our pools, so our PGs are really out of wack. Our most active pool 
has only 256 PGs and each PG is about 120 GB is size.
We are trying to clear out a pool that has way too many PGs so that we can 
split the PGs in that pool. I think these large PGs is part of our issues.

Things I've tried:

* Lowered nr_requests on the spindles from 1000 to 100. This reduced the max 
latency sometimes up to 3000 ms down to a max of 500-700 ms.
it has also reduced the huge swings in  latency, but has also reduced 
throughput somewhat.
* Changed the scheduler from deadline to CFQ. I'm not sure if the the OSD 
process gives the recovery threads a different disk priority or if changing the 
scheduler without restarting the OSD allows the OSD to use disk priorities.
* Reduced the number of osd_max_backfills from 2 to 1.
* Tried setting noin to give the new OSDs time to get the PG map and peer 
before starting the backfill. This caused more problems than solved as we had 
blocked I/O (over 200 seconds) until we set the new OSDs to in.

Even adding one OSD disk into the cluster is causing these slow I/O messages. 
We still have 5 more disks to add from this server and four more servers to add.

In addition to trying to minimize these impacts, would it be better to split 
the PGs then add the rest of the servers, or add the servers then do the PG 
split. I'm thinking splitting first would be better, but I'd like to get other 
opinions.

No spindle stays at high utilization for long and the await drops below 20 ms 
usually within 10 seconds so I/O should be serviced "pretty quick". My next 
guess is that the journals are getting full and blocking while waiting for 
flushes, but I'm not exactly sure how to identify that. We are using the 
defaults for the journal except for size (10G). We'd like to have journals 
large to handle bursts, but if they are getting filled with backfill traffic, 
it may be counter productive. Can/does backfill/recovery bypass the journal?

Thanks,

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1 -BEGIN 
PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV8e5qCRDmVDuy+mK58QAAaIwQAMN5DJlhrZkqwqsVXaKB
nnegQjG6Y02ObLRrg96ghHr+AGgY/HRm3iShng6E1N9CL+XjcHSLeb1JqH9n
2SgGQGoRAU1dY6DIlOs5K8Fwd2bBECh863VymYbO+OLgtXbpp2mWfZZVAkTf
V9ryaEh7tZOY1Mhx7mSIyr9Ur7IxTUOjzExAFPGfTLP1cbjE/FXoQMHh10fe
zSzk/qK0AvajFD0PR04uRyEsGYeCLl68kGQi1R7IQlxZWc7hMhWXKNIFlbKB
lk5+8OGx/LawW7qxpFm8a1SNoiAwMtrPKepvHYGi8u3rfXJa6ZE38jGuoqRs
8jD+b+gS0yxKbahT6S/gAEbgzAH0JF4YSz+nHNrvS6eSebykE9/7HGe9W7WA