Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-20 Thread Gregory Farnum
On Thu, Apr 20, 2017 at 2:31 AM, mj  wrote:
> Hi Gregory,
>
> Reading your reply with great interest, thanks.
>
> Can you confirm my understanding now:
>
> - live snapshots are more expensive for the cluster as a whole, than taking
> the snapshot when the VM is switched off?

No, it doesn't matter when the snapshot is taken. Just, once you take
a snapshot, subsequent writes to any object in the snapshot (ie, every
object in the RBD volume) will incur a copy.

>
> - using fstrim in VMs is (much?) more expensive when the VM has existing
> snapshots?

Hmm, I hadn't considered this one. Maybe?

>
> - it might be worthwhile to postpone upgrading from hammer to jewel, until
> after your big accouncement?

"Big announcement" is a bit much — several tunables will be restored
in the next (10.2.8?) release.

> - we are on xfs (both for the ceph OSDs and the VMs) and that is the best
> combination to avoid these slow requests and CoW overhead with snapshots (or
> at least to minimise their impact)
>
> Any other tips, do's or don'ts, or things to keep in mind related to
> snapshots, VM/OSD filesystems, or using fstrim..?
>
> (our cluster is also small, hammer, three servers with 8 OSDs each, and
> journals on ssd, plenty of cpu/ram)
>
> Again, thanks for your interesting post.
>
> MJ
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-20 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jogi 
> Hofmüller
> Sent: 20 April 2017 13:51
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] slow requests and short OSD failures in small 
> cluster
> 
> Hi,
> 
> Am Dienstag, den 18.04.2017, 18:34 + schrieb Peter Maloney:
> 
> > The 'slower with every snapshot even after CoW totally flattens it'
> > issue I just find easy to test, and I didn't test it on hammer or
> > earlier, and others confirmed it, but didn't keep track of the
> > versions. Just make an rbd image, map it (probably... but my tests
> > were with qemu librbd), do fio randwrite tests with sync and direct on
> > the device (no need for a fs, or anything), and then make a few snaps
> > and watch it go way slower.
> >
> > How about we make this thread a collection of versions then. And I'll
> > redo my test on Thursday maybe.
> 
> I did some tests now and provide the results and observations here:
> 
> This is the fio config file I used:
> 
> 
> [global]
> ioengine=rbd
> clientname=admin
> pool=images
> rbdname=benchmark
> invalidate=0# mandatory
> rw=randwrite
> bs=4k
> 
> [rbd_iodepth32]
> iodepth=32
> 
> 
> Results from fio on image 'benchmark' without any snapshots:
> 
> rbd_iodepth32: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd,
> iodepth=32
> fio-2.16
> Starting 1 process
> rbd engine: RBD version: 0.1.10
> Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/3620KB/0KB /s] [0/905/0 iops] [eta 
> 00m:00s]
> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=14192: Thu Apr 20
> 13:11:27 2017
>   write: io=8192.0MB, bw=1596.2KB/s, iops=399, runt=5252799msec
> slat (usec): min=1, max=6708, avg=173.27, stdev=97.65
> clat (msec): min=9, max=14505, avg=79.97, stdev=456.86
>  lat (msec): min=9, max=14505, avg=80.15, stdev=456.86
> clat percentiles (msec):
>  |  1.00th=[   26],  5.00th=[   28], 10.00th=[   28], 20.00th=[   30],
>  | 30.00th=[   31], 40.00th=[   32], 50.00th=[   33], 60.00th=[   35],
>  | 70.00th=[   37], 80.00th=[   39], 90.00th=[   43], 95.00th=[   47],
>  | 99.00th=[ 1516], 99.50th=[ 3621], 99.90th=[ 7046], 99.95th=[ 8094],
>  | 99.99th=[10159]
> lat (msec) : 10=0.01%, 20=0.29%, 50=96.17%, 100=1.49%, 250=0.31%
> lat (msec) : 500=0.21%, 750=0.15%, 1000=0.14%, 2000=0.38%,
> >=2000=0.85%
>   cpu  : usr=31.95%, sys=58.32%, ctx=5392823, majf=0, minf=0
>   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
> >=64=0.0%
>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
>  issued: total=r=0/w=2097152/d=0, short=r=0/w=0/d=0,
> drop=r=0/w=0/d=0
>  latency   : target=0, window=0, percentile=100.00%, depth=32
> 
> Run status group 0 (all jobs):
>   WRITE: io=8192.0MB, aggrb=1596KB/s, minb=1596KB/s, maxb=1596KB/s, 
> mint=5252799msec, maxt=5252799msec
> 
> Disk stats (read/write):
>   vdb: ios=6/20, merge=0/29, ticks=76/12168, in_queue=12244, util=0.23% sudo 
> fio rbd.fio  2023.87s user 3216.33s system 99% cpu
> 1:27:31.92 total
> 
> Now I created three snapshots of image 'benchmark'. Cluster became 
> iresponsive (slow requests stared to appear), a new run of fio
> never got passed 0.0%.
> 
> Removed all three snapshots. Cluster became responsive again, fio started to 
> work like before (left it running during snapshot
> removal).
> 
> Created one snapshot of 'benchmark' while fio was running. Cluster became 
> iresponsive after few minutes, fio got nothing done as
> soon as the snapshot was made.
> 
> Stopped here ;)

You are generating a write amplification of a 2000x, every 4kb write IO will 
generate a 4MB read and 4MB write. If your cluster can't handle that IO then 
you will see extremely poor performance. Is your real life workload actually 
doing random 4kb writes at qd=32? If it is you will either want to use RBD's 
made up of smaller objects to try and lessen the overheads, or probably forget 
about using snapshots, unless there is some sort of sparse bitmap based COW 
feature on the horizon???

> 
> Regards,
> --
> J.Hofmüller
> 
>mur.sat -- a space art project
>http://sat.mur.at/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-20 Thread Jogi Hofmüller
Hi,

Am Dienstag, den 18.04.2017, 18:34 + schrieb Peter Maloney:

> The 'slower with every snapshot even after CoW totally flattens it'
> issue I just find easy to test, and I didn't test it on hammer or
> earlier, and others confirmed it, but didn't keep track of the
> versions. Just make an rbd image, map it (probably... but my tests
> were with qemu librbd), do fio randwrite tests with sync and direct
> on the device (no need for a fs, or anything), and then make a few
> snaps and watch it go way slower. 
> 
> How about we make this thread a collection of versions then. And I'll
> redo my test on Thursday maybe.

I did some tests now and provide the results and observations here:

This is the fio config file I used:


[global]
ioengine=rbd
clientname=admin
pool=images
rbdname=benchmark
invalidate=0# mandatory
rw=randwrite
bs=4k

[rbd_iodepth32]
iodepth=32


Results from fio on image 'benchmark' without any snapshots:

rbd_iodepth32: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd,
iodepth=32
fio-2.16
Starting 1 process
rbd engine: RBD version: 0.1.10
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/3620KB/0KB /s] [0/905/0 iops]
[eta 00m:00s]
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=14192: Thu Apr 20
13:11:27 2017
  write: io=8192.0MB, bw=1596.2KB/s, iops=399, runt=5252799msec
slat (usec): min=1, max=6708, avg=173.27, stdev=97.65
clat (msec): min=9, max=14505, avg=79.97, stdev=456.86
 lat (msec): min=9, max=14505, avg=80.15, stdev=456.86
clat percentiles (msec):
 |  1.00th=[   26],  5.00th=[   28], 10.00th=[   28],
20.00th=[   30],
 | 30.00th=[   31], 40.00th=[   32], 50.00th=[   33],
60.00th=[   35],
 | 70.00th=[   37], 80.00th=[   39], 90.00th=[   43],
95.00th=[   47],
 | 99.00th=[ 1516], 99.50th=[ 3621], 99.90th=[ 7046], 99.95th=[
8094],
 | 99.99th=[10159]
lat (msec) : 10=0.01%, 20=0.29%, 50=96.17%, 100=1.49%, 250=0.31%
lat (msec) : 500=0.21%, 750=0.15%, 1000=0.14%, 2000=0.38%,
>=2000=0.85%
  cpu  : usr=31.95%, sys=58.32%, ctx=5392823, majf=0, minf=0
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
>=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
>=64=0.0%
 issued: total=r=0/w=2097152/d=0, short=r=0/w=0/d=0,
drop=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: io=8192.0MB, aggrb=1596KB/s, minb=1596KB/s, maxb=1596KB/s,
mint=5252799msec, maxt=5252799msec

Disk stats (read/write):
  vdb: ios=6/20, merge=0/29, ticks=76/12168, in_queue=12244, util=0.23%
sudo fio rbd.fio  2023.87s user 3216.33s system 99% cpu 1:27:31.92
total

Now I created three snapshots of image 'benchmark'. Cluster became
iresponsive (slow requests stared to appear), a new run of fio never
got passed 0.0%.

Removed all three snapshots. Cluster became responsive again, fio
started to work like before (left it running during snapshot removal).

Created one snapshot of 'benchmark' while fio was running. Cluster
became iresponsive after few minutes, fio got nothing done as soon as
the snapshot was made.

Stopped here ;)

Regards,
-- 
J.Hofmüller

   mur.sat -- a space art project
   http://sat.mur.at/


signature.asc
Description: This is a digitally signed message part
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-20 Thread mj

Hi Gregory,

Reading your reply with great interest, thanks.

Can you confirm my understanding now:

- live snapshots are more expensive for the cluster as a whole, than 
taking the snapshot when the VM is switched off?


- using fstrim in VMs is (much?) more expensive when the VM has existing 
snapshots?


- it might be worthwhile to postpone upgrading from hammer to jewel, 
until after your big accouncement?


- we are on xfs (both for the ceph OSDs and the VMs) and that is the 
best combination to avoid these slow requests and CoW overhead with 
snapshots (or at least to minimise their impact)


Any other tips, do's or don'ts, or things to keep in mind related to 
snapshots, VM/OSD filesystems, or using fstrim..?


(our cluster is also small, hammer, three servers with 8 OSDs each, and 
journals on ssd, plenty of cpu/ram)


Again, thanks for your interesting post.

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-19 Thread Gregory Farnum
On Tue, Apr 18, 2017 at 11:34 AM, Peter Maloney
 wrote:
> On 04/18/17 11:44, Jogi Hofmüller wrote:
>
> Hi,
>
> Am Dienstag, den 18.04.2017, 13:02 +0200 schrieb mj:
>
> On 04/18/2017 11:24 AM, Jogi Hofmüller wrote:
>
> This might have been true for hammer and older versions of ceph.
> From
> what I can tell now, every snapshot taken reduces performance of
> the
> entire cluster :(
>
> Really? Can others confirm this? Is this a 'wellknown fact'?
> (unknown only to us, perhaps...)
>
> I have to add some more/new details now. We started removing snapshots
> for VMs today. We did this VM for VM and waited some time in between
> while monitoring the cluster.
>
> After having removed all snapshots for the third VM the cluster went
> back to a 'normal' state again: no more slow requests. i/o waits for
> VMs are down to acceptable numbers again (<10% peeks, <5% average).
>
> So, either there is one VM/image that irritates the entire cluster or
> we reached some kind of threshold or it's something completely
> different.
>
> As for the well known fact: Peter Maloney pointed that out in this
> thread (mail from last Thursday).
>
> The well known fact part was CoW which I guess is for all versions.
>
> The 'slower with every snapshot even after CoW totally flattens it' issue I
> just find easy to test, and I didn't test it on hammer or earlier, and
> others confirmed it, but didn't keep track of the versions. Just make an rbd
> image, map it (probably... but my tests were with qemu librbd), do fio
> randwrite tests with sync and direct on the device (no need for a fs, or
> anything), and then make a few snaps and watch it go way slower.

I'm not sure this is a correct diagnosis or assessment.

In general, snapshots incur costs in two places:
1) the first write to an object after it is logically snapshotted,
2) when removing snapshots.

There should be no long-term performance degradation, especially in
XFS — it creates new copies of objects for each snapshot they change
in. (btrfs and bluestore use block-based CoW, so they can suffer from
fragmentation if things go too badly.)
However, the costs of snapshot trimming (especially in Jewel) have
been much discussed recently. (I'll have some announcements about
improvements there soon!) So if you've got live trims happening, yes,
there's an incremental load on the cluster.

Similarly, creating a snapshot requires copying each snapshotted
object into a new location, and then applying the write. Generally,
that should amortize into nothingness, but it sounds like in this case
you were basically doing a single IO per object for every snapshot you
created — which, yes, would be impressively slow overall.

The reports I've seen of slow snapshots have been one of the two above
issues. Sometimes it's compounded by people not having enough
incremental IOPS available to support their client workload while
doing snapshots, but that doesn't mean snapshots are inherently
expensive or inefficient[1], just that they do have a non-zero cost
which your cluster needs to be able to provide.
-Greg

[1]: Although, yes, snap trimming is more expensive than in many
similar systems. There are reasons for that which I discussed at Vault
and will present on again at the upcoming OpenStack Boston Ceph day.
:)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-18 Thread Peter Maloney
On 04/18/17 11:44, Jogi Hofmüller wrote:
> Hi,
>
> Am Dienstag, den 18.04.2017, 13:02 +0200 schrieb mj:
>> On 04/18/2017 11:24 AM, Jogi Hofmüller wrote:
>>> This might have been true for hammer and older versions of ceph.
>>> From
>>> what I can tell now, every snapshot taken reduces performance of
>>> the
>>> entire cluster :(
>> Really? Can others confirm this? Is this a 'wellknown fact'?
>> (unknown only to us, perhaps...)
> I have to add some more/new details now. We started removing snapshots
> for VMs today. We did this VM for VM and waited some time in between
> while monitoring the cluster.
>
> After having removed all snapshots for the third VM the cluster went
> back to a 'normal' state again: no more slow requests. i/o waits for
> VMs are down to acceptable numbers again (<10% peeks, <5% average).
>
> So, either there is one VM/image that irritates the entire cluster or
> we reached some kind of threshold or it's something completely
> different.
>
> As for the well known fact: Peter Maloney pointed that out in this
> thread (mail from last Thursday).
The well known fact part was CoW which I guess is for all versions.

The 'slower with every snapshot even after CoW totally flattens it'
issue I just find easy to test, and I didn't test it on hammer or
earlier, and others confirmed it, but didn't keep track of the versions.
Just make an rbd image, map it (probably... but my tests were with qemu
librbd), do fio randwrite tests with sync and direct on the device (no
need for a fs, or anything), and then make a few snaps and watch it go
way slower.

How about we make this thread a collection of versions then. And I'll
redo my test on Thursday maybe.
> Regards,
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-18 Thread Jogi Hofmüller
Hi,

Am Dienstag, den 18.04.2017, 13:02 +0200 schrieb mj:
> 
> On 04/18/2017 11:24 AM, Jogi Hofmüller wrote:
> > This might have been true for hammer and older versions of ceph.
> > From
> > what I can tell now, every snapshot taken reduces performance of
> > the
> > entire cluster :(
> 
> Really? Can others confirm this? Is this a 'wellknown fact'?
> (unknown only to us, perhaps...)

I have to add some more/new details now. We started removing snapshots
for VMs today. We did this VM for VM and waited some time in between
while monitoring the cluster.

After having removed all snapshots for the third VM the cluster went
back to a 'normal' state again: no more slow requests. i/o waits for
VMs are down to acceptable numbers again (<10% peeks, <5% average).

So, either there is one VM/image that irritates the entire cluster or
we reached some kind of threshold or it's something completely
different.

As for the well known fact: Peter Maloney pointed that out in this
thread (mail from last Thursday).

Regards,
-- 
J.Hofmüller

   http://thesix.mur.at/


signature.asc
Description: This is a digitally signed message part
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-18 Thread Lionel Bouton
Le 18/04/2017 à 11:24, Jogi Hofmüller a écrit :
> Hi,
>
> thanks for all you comments so far.
>
> Am Donnerstag, den 13.04.2017, 16:53 +0200 schrieb Lionel Bouton:
>> Hi,
>>
>> Le 13/04/2017 à 10:51, Peter Maloney a écrit :
>>> Ceph snapshots relly slow things down.
> I can confirm that now :(
>
>> We use rbd snapshots on Firefly (and Hammer now) and I didn't see any
>> measurable impact on performance... until we tried to remove them. We
>> usually have at least one snapshot per VM image, often 3 or 4.
> This might have been true for hammer and older versions of ceph. From
> what I can tell now, every snapshot taken reduces performance of the
> entire cluster :(

The version isn't the only difference here. We use BTRFS with a custom
defragmentation process for the filestores, which is highly uncommon for
Ceph users. As I said, Ceph has support for BTRFS CoW, so a part of the
snapshot handling processes is actually handled by BTRFS.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-18 Thread mj



On 04/18/2017 11:24 AM, Jogi Hofmüller wrote:

This might have been true for hammer and older versions of ceph. From
what I can tell now, every snapshot taken reduces performance of the
entire cluster :(


Really? Can others confirm this? Is this a 'wellknown fact'?
(unknown only to us, perhaps...)

We are still on hammer, but if the result of upgrading to jewel is 
actually a massive performance decrease, I might postpone as long as 
possible...


Most of our VMs have a snapshot or two...

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-18 Thread Jogi Hofmüller
Hi,

thanks for all you comments so far.

Am Donnerstag, den 13.04.2017, 16:53 +0200 schrieb Lionel Bouton:
> Hi,
> 
> Le 13/04/2017 à 10:51, Peter Maloney a écrit :
> > Ceph snapshots relly slow things down.

I can confirm that now :(

> We use rbd snapshots on Firefly (and Hammer now) and I didn't see any
> measurable impact on performance... until we tried to remove them. We
> usually have at least one snapshot per VM image, often 3 or 4.

This might have been true for hammer and older versions of ceph. From
what I can tell now, every snapshot taken reduces performance of the
entire cluster :(

So it looks like we were too naive in thinking that snapshots of VMs
done in ceph could be a viable backup solution. Which brings me to the
question, what are others doing for VM backup?

Regards,
-- 
J.Hofmüller

   http://thesix.mur.at/


signature.asc
Description: This is a digitally signed message part
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-14 Thread mj

ah right: _during_ the actual removal, you mean. :-)

clear now.

mj

On 04/13/2017 05:50 PM, Lionel Bouton wrote:

Le 13/04/2017 à 17:47, mj a écrit :

Hi,

On 04/13/2017 04:53 PM, Lionel Bouton wrote:

We use rbd snapshots on Firefly (and Hammer now) and I didn't see any
measurable impact on performance... until we tried to remove them.


What exactly do you mean with that?


Just what I said : having snapshots doesn't impact performance, only
removing them (obviously until Ceph is finished cleaning up).

Lionel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-13 Thread Lionel Bouton
Le 13/04/2017 à 17:47, mj a écrit :
> Hi,
>
> On 04/13/2017 04:53 PM, Lionel Bouton wrote:
>> We use rbd snapshots on Firefly (and Hammer now) and I didn't see any
>> measurable impact on performance... until we tried to remove them.
>
> What exactly do you mean with that?

Just what I said : having snapshots doesn't impact performance, only
removing them (obviously until Ceph is finished cleaning up).

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-13 Thread mj

Hi,

On 04/13/2017 04:53 PM, Lionel Bouton wrote:

We use rbd snapshots on Firefly (and Hammer now) and I didn't see any
measurable impact on performance... until we tried to remove them.


What exactly do you mean with that?

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-13 Thread David Turner
I wouldn't set the default for osd_heartbeat_grace to 5 minutes, but inject
it when you see this happening.  It's a good to know what your cluster is
up to.  The fact that you aren't seeing the blocked requests any more tells
me that this was your issue.  It will go through, split everything, go a
while and then do it again months from now.

On Thu, Apr 13, 2017 at 4:43 AM Jogi Hofmüller  wrote:

> Dear David,
>
> Am Mittwoch, den 12.04.2017, 13:46 + schrieb David Turner:
> > I can almost guarantee what you're seeing is PG subfolder splitting.
>
> Evey day there's something new to learn about ceph ;)
>
> > When the subfolders in a PG get X number of objects, it splits into
> > 16 subfolders.  Every cluster I manage has blocked requests and OSDs
> > that get marked down while this is happening.  To stop the OSDs
> > getting marked down, I increase the osd_heartbeat_grace until the
> > OSDs no longer mark themselves down during this process.
>
> Thanks for the hint. I adjusted the values accordingly and will monitor
> our cluster. This morning there were no troubles at all btw. Still
> wondering what caused yesterday's mayhem ...
>
> Regards,
> --
> J.Hofmüller
>
>Nisiti
>- Abie Nathan, 1927-2008
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-13 Thread Lionel Bouton
Hi,

Le 13/04/2017 à 10:51, Peter Maloney a écrit :
> [...]
> Also more things to consider...
>
> Ceph snapshots relly slow things down.

We use rbd snapshots on Firefly (and Hammer now) and I didn't see any
measurable impact on performance... until we tried to remove them. We
usually have at least one snapshot per VM image, often 3 or 4.
Note that we use BTRFS filestores where IIRC the CoW is handled by the
filesystem so it might be faster compared to the default/recommended XFS
filestores.

>  They aren't efficient like on
> zfs and btrfs. Having one might take away some % performance, and having
> 2 snaps takes potentially double, etc. until it is crawling. And it's
> not just the CoW... even just rbd snap rm, rbd diff, etc. starts to take
> many times longer. See http://tracker.ceph.com/issues/10823 for
> explanation of CoW. My goal is just to keep max 1 long term snapshot.[...]

In my experience with BTRFS filestores, snap rm impact is proportional
to the amount of data specific to the snapshot being removed (ie: not
present on any other snapshot) but completely unrelated to the number of
existing snapshots. For example the first one removed can be handled
very fast and it can be the last one removed that takes the most time
and impacts the most the performance.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-13 Thread Peter Maloney
On 04/13/17 10:34, Jogi Hofmüller wrote:
> Dear David,
>
> Am Mittwoch, den 12.04.2017, 13:46 + schrieb David Turner:
>> I can almost guarantee what you're seeing is PG subfolder splitting. 
> Evey day there's something new to learn about ceph ;)
>
>> When the subfolders in a PG get X number of objects, it splits into
>> 16 subfolders.  Every cluster I manage has blocked requests and OSDs
>> that get marked down while this is happening.  To stop the OSDs
>> getting marked down, I increase the osd_heartbeat_grace until the
>> OSDs no longer mark themselves down during this process.
> Thanks for the hint. I adjusted the values accordingly and will monitor
> our cluster. This morning there were no troubles at all btw. Still
> wondering what caused yesterday's mayhem ...
>
> Regards,
Also more things to consider...

Ceph snapshots relly slow things down. They aren't efficient like on
zfs and btrfs. Having one might take away some % performance, and having
2 snaps takes potentially double, etc. until it is crawling. And it's
not just the CoW... even just rbd snap rm, rbd diff, etc. starts to take
many times longer. See http://tracker.ceph.com/issues/10823 for
explanation of CoW. My goal is just to keep max 1 long term snapshot.

Also there's snap trimming, which I found to be far worse than directory
splitting. The settings I have for this and splitting are:
osd_pg_max_concurrent_snap_trims=1
osd_snap_trim_sleep=0
filestore_split_multiple=8

osd_snap_trim_sleep is bugged, holding a lock while sleeping, so make
sure it's 0.
filestore_split_multiple makes it split less often I think...not sure
how much this helps, but I subjectively think it improves it.

And I find that bcache makes little metadata operations like that (and
xattrs, leveldb, xfs journal, etc.) much less load on the osd disk.

I have not changed any timeouts and don't get any OSDs marked down. But
also I didn't before I tried bcache and other settings. I just got
blocked requests (and still do, but less), and hanging librbd client VMs
(disabling exclusive-lock fixes it).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-13 Thread Jogi Hofmüller
Dear David,

Am Mittwoch, den 12.04.2017, 13:46 + schrieb David Turner:
> I can almost guarantee what you're seeing is PG subfolder splitting. 

Evey day there's something new to learn about ceph ;)

> When the subfolders in a PG get X number of objects, it splits into
> 16 subfolders.  Every cluster I manage has blocked requests and OSDs
> that get marked down while this is happening.  To stop the OSDs
> getting marked down, I increase the osd_heartbeat_grace until the
> OSDs no longer mark themselves down during this process.

Thanks for the hint. I adjusted the values accordingly and will monitor
our cluster. This morning there were no troubles at all btw. Still
wondering what caused yesterday's mayhem ...

Regards,
-- 
J.Hofmüller

   Nisiti
   - Abie Nathan, 1927-2008



signature.asc
Description: This is a digitally signed message part
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-12 Thread David Turner
I can almost guarantee what you're seeing is PG subfolder splitting.  When
the subfolders in a PG get X number of objects, it splits into 16
subfolders.  Every cluster I manage has blocked requests and OSDs that get
marked down while this is happening.  To stop the OSDs getting marked down,
I increase the osd_heartbeat_grace until the OSDs no longer mark themselves
down during this process.  Based on your email, it looks like starting at 5
minutes would be a good place.  The blocked requests will still persist,
but the OSDs aren't being marked down regularly and adding peering to the
headache.

In 10.2.5 and 0.94.9, there was a way to take an OSD offline and tell it to
split the subfolders of its PGs.  I haven't done this yet, myself, but plan
to figure it out the next time I come across this sort of behavior.

On Wed, Apr 12, 2017 at 8:55 AM Jogi Hofmüller  wrote:

> Dear all,
>
> we run a small cluster [1] that is exclusively used for virtualisation
> (kvm/libvirt). Recently we started to run into performance problems
> (slow requests, failing OSDs) for no *obvious* reason (at least not for
> us).
>
> We do nightly snapshots of VM images and keep the snapshots for 14
> days. Currently we run 8 VMs in the cluster.
>
> At first it looked like the problem was related to snapshotting images
> of VMs that were up and running (respectively deleting the snapshots
> after 14 days). So we changed the procedure to first suspend the VM and
> the snapshot its image(s). Snapshots are made at 4 am.
>
> When we removed *all* the old snapshots (the ones done of running VMs)
> the cluster suddenly behaved 'normal' again, but after two days of
> creating snapshots (not deleting any) of suspended VMs, the slow
> requests started again (although by far not as frequent as before).
>
> This morning we experienced subsequent failures (e.g. osd.2
> IPv4:6800/1621 failed (2 reporters from different host after 49.976472
> >= grace 46.444312) of 4 of our 6 OSDs, resulting in HEALTH_WARN with
> up to about 20% of PGs active+undersized+degraded or stale+active+clean
> or remapped+peering. No OSD failure lasted longer than 4 minutes. After
> 15 minutes everything was back to normal again. The noise started at
> 6:25 am, a time when cron.daily scripts run here.
>
> We have no clue what could have caused this behavior :( There seems to
> be no shortage of resources (CPU, RAM, network) that would explain what
> happened, but maybe we did not look in the right places. So any hint on
> where to look/what to look for would be greatly appreciated :)
>
> [1]  cluster setup
>
> Three nodes: ceph1, ceph2, ceph3
>
> ceph1 and ceph2
>
> 1x Intel(R) Xeon(R) CPU E3-1275 v3 @ 3.50GHz
> 32 GB RAM
> RAID1 for OS
> 1x Intel 530 Series SSDs (120GB) for Journals
> 3x WDC WD2500BUCT-63TWBY0 for OSDs (1TB)
> 2x Gbit Ethernet bonded (802.3ad) on HP 2920 Stack
>
> ceph3
>
> virtual machine
> 1 CPU
> 4 GB RAM
>
> Software
>
> Debian GNU/Linux Jessie (8.7)
> Kernel 3.16
> ceph 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)
>
> Ceph Services
>
> 3 Monitors: ceph1, ceph2, ceph3
>
> 6 OSDs: ceph1 (3), ceph2 (3)
>
> Regards,
> --
> J.Hofmüller
>
>Nisiti
>- Abie Nathan, 1927-2008
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] slow requests and short OSD failures in small cluster

2017-04-12 Thread Jogi Hofmüller
Dear all,

we run a small cluster [1] that is exclusively used for virtualisation
(kvm/libvirt). Recently we started to run into performance problems
(slow requests, failing OSDs) for no *obvious* reason (at least not for
us).

We do nightly snapshots of VM images and keep the snapshots for 14
days. Currently we run 8 VMs in the cluster.

At first it looked like the problem was related to snapshotting images
of VMs that were up and running (respectively deleting the snapshots
after 14 days). So we changed the procedure to first suspend the VM and
the snapshot its image(s). Snapshots are made at 4 am.

When we removed *all* the old snapshots (the ones done of running VMs)
the cluster suddenly behaved 'normal' again, but after two days of
creating snapshots (not deleting any) of suspended VMs, the slow
requests started again (although by far not as frequent as before).

This morning we experienced subsequent failures (e.g. osd.2
IPv4:6800/1621 failed (2 reporters from different host after 49.976472
>= grace 46.444312) of 4 of our 6 OSDs, resulting in HEALTH_WARN with
up to about 20% of PGs active+undersized+degraded or stale+active+clean
or remapped+peering. No OSD failure lasted longer than 4 minutes. After
15 minutes everything was back to normal again. The noise started at
6:25 am, a time when cron.daily scripts run here.

We have no clue what could have caused this behavior :( There seems to
be no shortage of resources (CPU, RAM, network) that would explain what
happened, but maybe we did not look in the right places. So any hint on
where to look/what to look for would be greatly appreciated :)

[1]  cluster setup

Three nodes: ceph1, ceph2, ceph3

ceph1 and ceph2

1x Intel(R) Xeon(R) CPU E3-1275 v3 @ 3.50GHz
32 GB RAM
RAID1 for OS
1x Intel 530 Series SSDs (120GB) for Journals
3x WDC WD2500BUCT-63TWBY0 for OSDs (1TB)
2x Gbit Ethernet bonded (802.3ad) on HP 2920 Stack 

ceph3

virtual machine
1 CPU
4 GB RAM 

Software

Debian GNU/Linux Jessie (8.7)
Kernel 3.16
ceph 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f) 

Ceph Services

3 Monitors: ceph1, ceph2, ceph3

6 OSDs: ceph1 (3), ceph2 (3) 

Regards,
-- 
J.Hofmüller

   Nisiti
   - Abie Nathan, 1927-2008



signature.asc
Description: This is a digitally signed message part
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com