Re: [ceph-users] Removing Snapshots Killing Cluster Performance

Craig Lewis Tue, 02 Dec 2014 11:35:42 -0800

On Mon, Dec 1, 2014 at 1:51 AM, Daniel Schneller <
[email protected]> wrote:


>
> I could not find any way to throttle the background deletion activity
>
> (the command returns almost immediately).
>

I'm only aware of osd snap trim sleep.  I haven't tried this since my
Firefly upgrade though.

I have tested out osd scrub sleep under a heavy deep-scrub load, and found
that I needed a value of 1.0, which is much higher than the recommended
starting point of 0.005.  I'll revisit this when #9487 gets backported
(Thanks Dan Van Der Ster!).

I used ceph tell osd.\* injectargs, and watched my IO graphs.  Start with
0.005, and multiple by 10 until you see a change.  It took 10-60 seconds to
see a change after injecting the args.

While this is a big issue in itself for us, we would at least try to
>
> estimate how long the process will take per snapshot / per pool. I
>
> assume the time needed is a function of the number of objects that were
>
> modified between two snapshots.
>

That matches my experiences as well.  "Big" snapshots are take longer, and
are much more likely to cause a cluster outage than "small" snapshots.



> 1) Is there any way to control how much such an operation will
>
> tax the cluster (we would be happy to have it run longer, if that meant
>
> not utilizing all disks fully during that time)?
>

On Firefly,  osd snap trim sleep, and playing with the CFQ scheduler are
your only options.  They're not great options.  If you can upgrade to
Giant, the snap trim sleep should solve your problem.

There is some work being done in Hammer:
https://wiki.ceph.com/Planning/Blueprints/Hammer/osd%3A_Scrub%2F%2FSnapTrim_IO_prioritization

For the time being, I'm letting my snapshots accumulate.  I can't recover
anything without the database backups, and those are deleted on time, so I
can say with a straight face that their data is deleted.  I'll collect the
garbage later.


> 3) Would SSD journals help here? Or any other hardware configuration
>
> change for that matter?
>

Probably, but it's not going to fix it.  I added SSD journals.  It's
better, but I still had downtime after trimming.  I'm glad I added them
though.  The cluster is overall are much healthier and more responsive.  In
particular, backfilling doesn't cause massive latency anymore.



> 4) Any other recommendations? We definitely need to remove the data,
>
> not because of a lack of space (at least not at the moment), but because
>
> when customers delete stuff / cancel accounts, we are obliged to remove
>
> their data at least after a reasonable amount of time.
>

I know it's kind of snarky, but perhaps you can redefine "reasonable" until
you have a change to upgrade to Giant or Hammer?

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Removing Snapshots Killing Cluster Performance

Reply via email to