Re: [ceph-users] Speeding up garbage collection in RGW

2017-10-27 Thread David Turner
I had the exact same error when using --bypass-gc.  We too decided to
destroy this realm and start it fresh.  For us, 95% of the data in this
realm is backups for other systems and they're find rebuilding it.  So our
plan is to migrate the 5% of the data to a temporary s3 location and then
rebuild this realm with brand-new pools, a fresh GC, and new settings. I
can add this realm to the offerings of tests to figure out options.  It's
running Jewel 10.2.7.

On Fri, Oct 27, 2017 at 11:26 AM Bryan Stillwell 
wrote:

> On Wed, Oct 25, 2017 at 4:02 PM, Yehuda Sadeh-Weinraub 
> wrote:
> >
> > On Wed, Oct 25, 2017 at 2:32 PM, Bryan Stillwell 
> wrote:
> > > That helps a little bit, but overall the process would take years at
> this
> > > rate:
> > >
> > > # for i in {1..3600}; do ceph df -f json-pretty |grep -A7
> '".rgw.buckets"' |grep objects; sleep 60; done
> > >  "objects": 1660775838
> > >  "objects": 1660775733
> > >  "objects": 1660775548
> > >  "objects": 1660774825
> > >  "objects": 1660774790
> > >  "objects": 1660774735
> > >
> > > This is on a hammer cluster.  Would upgrading to Jewel or Luminous
> speed up
> > > this process at all?
> >
> > I'm not sure it's going to help much, although the omap performance
> > might improve there. The big problem is that the omaps are just too
> > big, so that every operation on them take considerable time. I think
> > the best way forward there is to take a list of all the rados objects
> > that need to be removed from the gc omaps, and then get rid of the gc
> > objects themselves (newer ones will be created, this time using the
> > new configurable). Then remove the objects manually (and concurrently)
> > using the rados command line tool.
> > The one problem I see here is that even just removal of objects with
> > large omaps can affect the availability of the osds that hold these
> > objects. I discussed that now with Josh, and we think the best way to
> > deal with that is not to remove the gc objects immediatly, but to
> > rename the gc pool, and create a new one (with appropriate number of
> > pgs). This way new gc entries will now go into the new gc pool (with
> > higher number of gc shards), and you don't need to remove the old gc
> > objects (thus no osd availability problem). Then you can start
> > trimming the old gc objects (on the old renamed pool) by using the
> > rados command. It'll take a very very long time, but the process
> > should pick up speed slowly, as the objects shrink.
>
> That's fine for us.  We'll be tearing down this cluster in a few weeks
> and adding the nodes to the new cluster we created.  I just wanted to
> explore other options now that we can use it as a test cluster.
>
> The solution you described with renaming the .rgw.gc pool and creating a
> new one is pretty interesting.  I'll have to give that a try, but until
> then I've been trying to remove some of the other buckets with the
> --bypass-gc option and it keeps dying with output like this:
>
> # radosgw-admin bucket rm --bucket=sg2pl5000 --purge-objects --bypass-gc
> 2017-10-27 08:00:00.865993 7f2b387228c0  0 RGWObjManifest::operator++():
> result: ofs=1488744 stripe_ofs=1488744 part_ofs=0 rule->part_size=0
> 2017-10-27 08:00:04.385875 7f2b387228c0  0 RGWObjManifest::operator++():
> result: ofs=673900 stripe_ofs=673900 part_ofs=0 rule->part_size=0
> 2017-10-27 08:00:04.517241 7f2b387228c0  0 RGWObjManifest::operator++():
> result: ofs=1179224 stripe_ofs=1179224 part_ofs=0 rule->part_size=0
> 2017-10-27 08:00:05.791876 7f2b387228c0  0 RGWObjManifest::operator++():
> result: ofs=566620 stripe_ofs=566620 part_ofs=0 rule->part_size=0
> 2017-10-27 08:00:26.815081 7f2b387228c0  0 RGWObjManifest::operator++():
> result: ofs=1090645 stripe_ofs=1090645 part_ofs=0 rule->part_size=0
> 2017-10-27 08:00:46.757556 7f2b387228c0  0 RGWObjManifest::operator++():
> result: ofs=1488744 stripe_ofs=1488744 part_ofs=0 rule->part_size=0
> 2017-10-27 08:00:47.093813 7f2b387228c0 -1 ERROR: could not drain handles
> as aio completion returned with -2
>
>
> I can typically make further progress by running it again:
>
> # radosgw-admin bucket rm --bucket=sg2pl5000 --purge-objects --bypass-gc
> 2017-10-27 08:20:57.310859 7fae9c3d48c0  0 RGWObjManifest::operator++():
> result: ofs=673900 stripe_ofs=673900 part_ofs=0 rule->part_size=0
> 2017-10-27 08:20:57.406684 7fae9c3d48c0  0 RGWObjManifest::operator++():
> result: ofs=1179224 stripe_ofs=1179224 part_ofs=0 rule->part_size=0
> 2017-10-27 08:20:57.808050 7fae9c3d48c0 -1 ERROR: could not drain handles
> as aio completion returned with -2
>
>
> and again:
>
> # radosgw-admin bucket rm --bucket=sg2pl5000 --purge-objects --bypass-gc
> 2017-10-27 08:22:04.992578 7ff8071038c0  0 RGWObjManifest::operator++():
> result: ofs=566620 stripe_ofs=566620 part_ofs=0 rule->part_size=0
> 2017-10-27 

Re: [ceph-users] Speeding up garbage collection in RGW

2017-10-27 Thread Bryan Stillwell
On Wed, Oct 25, 2017 at 4:02 PM, Yehuda Sadeh-Weinraub  
wrote:
>
> On Wed, Oct 25, 2017 at 2:32 PM, Bryan Stillwell  
> wrote:
> > That helps a little bit, but overall the process would take years at this
> > rate:
> >
> > # for i in {1..3600}; do ceph df -f json-pretty |grep -A7 '".rgw.buckets"' 
> > |grep objects; sleep 60; done
> >  "objects": 1660775838
> >  "objects": 1660775733
> >  "objects": 1660775548
> >  "objects": 1660774825
> >  "objects": 1660774790
> >  "objects": 1660774735
> >
> > This is on a hammer cluster.  Would upgrading to Jewel or Luminous speed up
> > this process at all?
>
> I'm not sure it's going to help much, although the omap performance
> might improve there. The big problem is that the omaps are just too
> big, so that every operation on them take considerable time. I think
> the best way forward there is to take a list of all the rados objects
> that need to be removed from the gc omaps, and then get rid of the gc
> objects themselves (newer ones will be created, this time using the
> new configurable). Then remove the objects manually (and concurrently)
> using the rados command line tool.
> The one problem I see here is that even just removal of objects with
> large omaps can affect the availability of the osds that hold these
> objects. I discussed that now with Josh, and we think the best way to
> deal with that is not to remove the gc objects immediatly, but to
> rename the gc pool, and create a new one (with appropriate number of
> pgs). This way new gc entries will now go into the new gc pool (with
> higher number of gc shards), and you don't need to remove the old gc
> objects (thus no osd availability problem). Then you can start
> trimming the old gc objects (on the old renamed pool) by using the
> rados command. It'll take a very very long time, but the process
> should pick up speed slowly, as the objects shrink.

That's fine for us.  We'll be tearing down this cluster in a few weeks
and adding the nodes to the new cluster we created.  I just wanted to
explore other options now that we can use it as a test cluster.

The solution you described with renaming the .rgw.gc pool and creating a
new one is pretty interesting.  I'll have to give that a try, but until
then I've been trying to remove some of the other buckets with the
--bypass-gc option and it keeps dying with output like this:

# radosgw-admin bucket rm --bucket=sg2pl5000 --purge-objects --bypass-gc
2017-10-27 08:00:00.865993 7f2b387228c0  0 RGWObjManifest::operator++(): 
result: ofs=1488744 stripe_ofs=1488744 part_ofs=0 rule->part_size=0
2017-10-27 08:00:04.385875 7f2b387228c0  0 RGWObjManifest::operator++(): 
result: ofs=673900 stripe_ofs=673900 part_ofs=0 rule->part_size=0
2017-10-27 08:00:04.517241 7f2b387228c0  0 RGWObjManifest::operator++(): 
result: ofs=1179224 stripe_ofs=1179224 part_ofs=0 rule->part_size=0
2017-10-27 08:00:05.791876 7f2b387228c0  0 RGWObjManifest::operator++(): 
result: ofs=566620 stripe_ofs=566620 part_ofs=0 rule->part_size=0
2017-10-27 08:00:26.815081 7f2b387228c0  0 RGWObjManifest::operator++(): 
result: ofs=1090645 stripe_ofs=1090645 part_ofs=0 rule->part_size=0
2017-10-27 08:00:46.757556 7f2b387228c0  0 RGWObjManifest::operator++(): 
result: ofs=1488744 stripe_ofs=1488744 part_ofs=0 rule->part_size=0
2017-10-27 08:00:47.093813 7f2b387228c0 -1 ERROR: could not drain handles as 
aio completion returned with -2


I can typically make further progress by running it again:

# radosgw-admin bucket rm --bucket=sg2pl5000 --purge-objects --bypass-gc
2017-10-27 08:20:57.310859 7fae9c3d48c0  0 RGWObjManifest::operator++(): 
result: ofs=673900 stripe_ofs=673900 part_ofs=0 rule->part_size=0
2017-10-27 08:20:57.406684 7fae9c3d48c0  0 RGWObjManifest::operator++(): 
result: ofs=1179224 stripe_ofs=1179224 part_ofs=0 rule->part_size=0
2017-10-27 08:20:57.808050 7fae9c3d48c0 -1 ERROR: could not drain handles as 
aio completion returned with -2


and again:

# radosgw-admin bucket rm --bucket=sg2pl5000 --purge-objects --bypass-gc
2017-10-27 08:22:04.992578 7ff8071038c0  0 RGWObjManifest::operator++(): 
result: ofs=566620 stripe_ofs=566620 part_ofs=0 rule->part_size=0
2017-10-27 08:22:05.726485 7ff8071038c0 -1 ERROR: could not drain handles as 
aio completion returned with -2


What does this error mean, and is there any way to keep it from dying
like this?  This cluster is running 0.94.10, but I can upgrade it to jewel
pretty easily if you would like.

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up garbage collection in RGW

2017-10-25 Thread Yehuda Sadeh-Weinraub
On Wed, Oct 25, 2017 at 2:32 PM, Bryan Stillwell <bstillw...@godaddy.com> wrote:
> That helps a little bit, but overall the process would take years at this
> rate:
>
>
>
> # for i in {1..3600}; do ceph df -f json-pretty |grep -A7 '".rgw.buckets"'
> |grep objects; sleep 60; done
>
> "objects": 1660775838
>
> "objects": 1660775733
>
> "objects": 1660775548
>
> "objects": 1660774825
>
> "objects": 1660774790
>
> "objects": 1660774735
>
>
>
> This is on a hammer cluster.  Would upgrading to Jewel or Luminous speed up
> this process at all?
>

I'm not sure it's going to help much, although the omap performance
might improve there. The big problem is that the omaps are just too
big, so that every operation on them take considerable time. I think
the best way forward there is to take a list of all the rados objects
that need to be removed from the gc omaps, and then get rid of the gc
objects themselves (newer ones will be created, this time using the
new configurable). Then remove the objects manually (and concurrently)
using the rados command line tool.
The one problem I see here is that even just removal of objects with
large omaps can affect the availability of the osds that hold these
objects. I discussed that now with Josh, and we think the best way to
deal with that is not to remove the gc objects immediatly, but to
rename the gc pool, and create a new one (with appropriate number of
pgs). This way new gc entries will now go into the new gc pool (with
higher number of gc shards), and you don't need to remove the old gc
objects (thus no osd availability problem). Then you can start
trimming the old gc objects (on the old renamed pool) by using the
rados command. It'll take a very very long time, but the process
should pick up speed slowly, as the objects shrink.

Yehuda

>
>
> Bryan
>
>
>
> From: Yehuda Sadeh-Weinraub <yeh...@redhat.com>
> Date: Wednesday, October 25, 2017 at 11:32 AM
> To: Bryan Stillwell <bstillw...@godaddy.com>
> Cc: David Turner <drakonst...@gmail.com>, Ben Hines <bhi...@gmail.com>,
> "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
>
>
> Subject: Re: [ceph-users] Speeding up garbage collection in RGW
>
>
>
> Some of the options there won't do much for you as they'll only affect
>
> newer object removals. I think the default number of gc objects is
>
> just inadequate for your needs. You can try manually running
>
> 'radosgw-admin gc process' concurrently (for the start 2 or 3
>
> processes), see if it makes any dent there. I think one of the problem
>
> is that the gc omaps grew so much that operations on them are too
>
> slow.
>
>
>
> Yehuda
>
>
>
> On Wed, Oct 25, 2017 at 9:05 AM, Bryan Stillwell <bstillw...@godaddy.com>
> wrote:
>
> We tried various options like the one's Ben mentioned to speed up the
> garbage collection process and were unsuccessful.  Luckily, we had the
> ability to create a new cluster and move all the data that wasn't part of
> the POC which created our problem.
>
>
>
> One of the things we ran into was the .rgw.gc pool became too large to
> handle drive failures without taking down the cluster.  We eventually had to
> move that pool to SSDs just to get the cluster healthy.  It was not obvious
> it was getting large though, because this is what it looked like in the
> 'ceph df' output:
>
>
>
>  NAME   ID USED  %USED MAX AVAIL OBJECTS
>
>  .rgw.gc17 0 0  235G
> 2647
>
>
>
> However, if you look at the SSDs we used (repurposed journal SSDs to get out
> of the disaster) in 'ceph osd df' you can see quite a bit of data is being
> used:
>
>
>
> 410 0.2  1.0  181G 23090M   158G 12.44 0.18
>
> 411 0.2  1.0  181G 29105M   152G 15.68 0.22
>
> 412 0.2  1.0  181G   110G 72223M 61.08 0.86
>
> 413 0.2  1.0  181G 42964M   139G 23.15 0.33
>
> 414 0.2  1.0  181G 33530M   148G 18.07 0.26
>
> 415 0.2  1.0  181G 38420M   143G 20.70 0.29
>
> 416 0.2  1.0  181G 92215M 93355M 49.69 0.70
>
> 417 0.2  1.0  181G 64730M   118G 34.88 0.49
>
> 418 0.2  1.0  181G 61353M   121G 33.06 0.47
>
> 419 0.2  1.0  181G 77168M   105G 41.58 0.59
>
>
>
> That's ~560G of omap data for the .rgw.gc pool that isn't being reported in
> 'ceph df'.
>
>
>
> Right now the cluster is still around while we wait to verify the new
> cluster isn't missing anything.  So if the

Re: [ceph-users] Speeding up garbage collection in RGW

2017-10-25 Thread Bryan Stillwell
That helps a little bit, but overall the process would take years at this rate:

# for i in {1..3600}; do ceph df -f json-pretty |grep -A7 '".rgw.buckets"' 
|grep objects; sleep 60; done
"objects": 1660775838
"objects": 1660775733
"objects": 1660775548
"objects": 1660774825
"objects": 1660774790
"objects": 1660774735

This is on a hammer cluster.  Would upgrading to Jewel or Luminous speed up 
this process at all?

Bryan

From: Yehuda Sadeh-Weinraub <yeh...@redhat.com>
Date: Wednesday, October 25, 2017 at 11:32 AM
To: Bryan Stillwell <bstillw...@godaddy.com>
Cc: David Turner <drakonst...@gmail.com>, Ben Hines <bhi...@gmail.com>, 
"ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Speeding up garbage collection in RGW

Some of the options there won't do much for you as they'll only affect
newer object removals. I think the default number of gc objects is
just inadequate for your needs. You can try manually running
'radosgw-admin gc process' concurrently (for the start 2 or 3
processes), see if it makes any dent there. I think one of the problem
is that the gc omaps grew so much that operations on them are too
slow.

Yehuda

On Wed, Oct 25, 2017 at 9:05 AM, Bryan Stillwell 
<bstillw...@godaddy.com<mailto:bstillw...@godaddy.com>> wrote:
We tried various options like the one's Ben mentioned to speed up the garbage 
collection process and were unsuccessful.  Luckily, we had the ability to 
create a new cluster and move all the data that wasn't part of the POC which 
created our problem.

One of the things we ran into was the .rgw.gc pool became too large to handle 
drive failures without taking down the cluster.  We eventually had to move that 
pool to SSDs just to get the cluster healthy.  It was not obvious it was 
getting large though, because this is what it looked like in the 'ceph df' 
output:

 NAME   ID USED  %USED MAX AVAIL OBJECTS
 .rgw.gc17 0 0  235G   2647

However, if you look at the SSDs we used (repurposed journal SSDs to get out of 
the disaster) in 'ceph osd df' you can see quite a bit of data is being used:

410 0.2  1.0  181G 23090M   158G 12.44 0.18
411 0.2  1.0  181G 29105M   152G 15.68 0.22
412 0.2  1.0  181G   110G 72223M 61.08 0.86
413 0.2  1.0  181G 42964M   139G 23.15 0.33
414 0.2  1.0  181G 33530M   148G 18.07 0.26
415 0.2  1.0  181G 38420M   143G 20.70 0.29
416 0.2  1.0  181G 92215M 93355M 49.69 0.70
417 0.2  1.0  181G 64730M   118G 34.88 0.49
418 0.2  1.0  181G 61353M   121G 33.06 0.47
419 0.2  1.0  181G 77168M   105G 41.58 0.59

That's ~560G of omap data for the .rgw.gc pool that isn't being reported in 
'ceph df'.

Right now the cluster is still around while we wait to verify the new cluster 
isn't missing anything.  So if there is anything the RGW developers would like 
to try on it to speed up the gc process, we should be able to do that.

Bryan

From: ceph-users 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of David Turner <drakonst...@gmail.com<mailto:drakonst...@gmail.com>>
Date: Tuesday, October 24, 2017 at 4:07 PM
To: Ben Hines <bhi...@gmail.com<mailto:bhi...@gmail.com>>
Cc: "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>" 
<ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] Speeding up garbage collection in RGW

Thank you so much for chiming in, Ben.

Can you explain what each setting value means? I believe I understand min wait, 
that's just how long to wait before allowing the object to be cleaned up.  gc 
max objs is how many will be cleaned up during each period?  gc processor 
period is how often it will kick off gc to clean things up?  And gc processor 
max time is the longest the process can run after the period starts?  Is that 
about right for that?  I read somewhere saying that prime numbers are optimal 
for gc max objs.  Do you know why that is?  I notice you're using one there.  
What is lc max objs?  I couldn't find a reference for that setting.

Additionally, do you know if the radosgw-admin gc list is ever cleaned up, or 
is it an ever growing list?  I got up to 3.6 Billion objects in the list before 
I killed the gc list command.

On Tue, Oct 24, 2017 at 4:47 PM Ben Hines 
<bhi...@gmail.com<mailto:bhi...@gmail.com>> wrote:
I agree the settings are rather confusing. We also have many millions of 
objects and had this trouble, so i set these rather aggressive gc settings on 
our cluster which result in gc almost always running. We also use lifecycles to 
expire objects.

rgw lifecy

Re: [ceph-users] Speeding up garbage collection in RGW

2017-10-25 Thread Yehuda Sadeh-Weinraub
Some of the options there won't do much for you as they'll only affect
newer object removals. I think the default number of gc objects is
just inadequate for your needs. You can try manually running
'radosgw-admin gc process' concurrently (for the start 2 or 3
processes), see if it makes any dent there. I think one of the problem
is that the gc omaps grew so much that operations on them are too
slow.

Yehuda

On Wed, Oct 25, 2017 at 9:05 AM, Bryan Stillwell <bstillw...@godaddy.com> wrote:
> We tried various options like the one's Ben mentioned to speed up the garbage 
> collection process and were unsuccessful.  Luckily, we had the ability to 
> create a new cluster and move all the data that wasn't part of the POC which 
> created our problem.
>
> One of the things we ran into was the .rgw.gc pool became too large to handle 
> drive failures without taking down the cluster.  We eventually had to move 
> that pool to SSDs just to get the cluster healthy.  It was not obvious it was 
> getting large though, because this is what it looked like in the 'ceph df' 
> output:
>
> NAME   ID USED  %USED MAX AVAIL OBJECTS
> .rgw.gc17 0 0  235G   2647
>
> However, if you look at the SSDs we used (repurposed journal SSDs to get out 
> of the disaster) in 'ceph osd df' you can see quite a bit of data is being 
> used:
>
> 410 0.2  1.0  181G 23090M   158G 12.44 0.18
> 411 0.2  1.0  181G 29105M   152G 15.68 0.22
> 412 0.2  1.0  181G   110G 72223M 61.08 0.86
> 413 0.2  1.0  181G 42964M   139G 23.15 0.33
> 414 0.2  1.0  181G 33530M   148G 18.07 0.26
> 415 0.2  1.0  181G 38420M   143G 20.70 0.29
> 416 0.2  1.0  181G 92215M 93355M 49.69 0.70
> 417 0.2  1.0  181G 64730M   118G 34.88 0.49
> 418 0.2  1.0  181G 61353M   121G 33.06 0.47
> 419 0.2  1.0  181G 77168M   105G 41.58 0.59
>
> That's ~560G of omap data for the .rgw.gc pool that isn't being reported in 
> 'ceph df'.
>
> Right now the cluster is still around while we wait to verify the new cluster 
> isn't missing anything.  So if there is anything the RGW developers would 
> like to try on it to speed up the gc process, we should be able to do that.
>
> Bryan
>
> From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of David 
> Turner <drakonst...@gmail.com>
> Date: Tuesday, October 24, 2017 at 4:07 PM
> To: Ben Hines <bhi...@gmail.com>
> Cc: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] Speeding up garbage collection in RGW
>
> Thank you so much for chiming in, Ben.
>
> Can you explain what each setting value means? I believe I understand min 
> wait, that's just how long to wait before allowing the object to be cleaned 
> up.  gc max objs is how many will be cleaned up during each period?  gc 
> processor period is how often it will kick off gc to clean things up?  And gc 
> processor max time is the longest the process can run after the period 
> starts?  Is that about right for that?  I read somewhere saying that prime 
> numbers are optimal for gc max objs.  Do you know why that is?  I notice 
> you're using one there.  What is lc max objs?  I couldn't find a reference 
> for that setting.
>
> Additionally, do you know if the radosgw-admin gc list is ever cleaned up, or 
> is it an ever growing list?  I got up to 3.6 Billion objects in the list 
> before I killed the gc list command.
>
> On Tue, Oct 24, 2017 at 4:47 PM Ben Hines <bhi...@gmail.com> wrote:
> I agree the settings are rather confusing. We also have many millions of 
> objects and had this trouble, so i set these rather aggressive gc settings on 
> our cluster which result in gc almost always running. We also use lifecycles 
> to expire objects.
>
> rgw lifecycle work time = 00:01-23:59
> rgw gc max objs = 2647
> rgw lc max objs = 2647
> rgw gc obj min wait = 300
> rgw gc processor period = 600
> rgw gc processor max time = 600
>
>
> -Ben
>
> On Tue, Oct 24, 2017 at 9:25 AM, David Turner <drakonst...@gmail.com> wrote:
> As I'm looking into this more and more, I'm realizing how big of a problem 
> garbage collection has been in our clusters.  The biggest cluster has over 1 
> billion objects in its gc list (the command is still running, it just 
> recently passed by the 1B mark).  Does anyone have any guidance on what to do 
> to optimize the gc settings to hopefully/eventually catch up on this as well 
> as stay caught up once we are?  I'm not expecting an overnight fix, but 
> something that could feasibly be caught up within 6 months would be wonderful.
>
> On Mon, Oct 23, 2

Re: [ceph-users] Speeding up garbage collection in RGW

2017-10-25 Thread Bryan Stillwell
We tried various options like the one's Ben mentioned to speed up the garbage 
collection process and were unsuccessful.  Luckily, we had the ability to 
create a new cluster and move all the data that wasn't part of the POC which 
created our problem.

One of the things we ran into was the .rgw.gc pool became too large to handle 
drive failures without taking down the cluster.  We eventually had to move that 
pool to SSDs just to get the cluster healthy.  It was not obvious it was 
getting large though, because this is what it looked like in the 'ceph df' 
output:

NAME   ID USED  %USED MAX AVAIL OBJECTS
.rgw.gc17 0 0  235G   2647

However, if you look at the SSDs we used (repurposed journal SSDs to get out of 
the disaster) in 'ceph osd df' you can see quite a bit of data is being used:

410 0.2  1.0  181G 23090M   158G 12.44 0.18
411 0.2  1.0  181G 29105M   152G 15.68 0.22
412 0.2  1.0  181G   110G 72223M 61.08 0.86
413 0.2  1.0  181G 42964M   139G 23.15 0.33
414 0.2  1.0  181G 33530M   148G 18.07 0.26
415 0.2  1.0  181G 38420M   143G 20.70 0.29
416 0.2  1.0  181G 92215M 93355M 49.69 0.70
417 0.2  1.0  181G 64730M   118G 34.88 0.49
418 0.2  1.0  181G 61353M   121G 33.06 0.47
419 0.2  1.0  181G 77168M   105G 41.58 0.59

That's ~560G of omap data for the .rgw.gc pool that isn't being reported in 
'ceph df'.

Right now the cluster is still around while we wait to verify the new cluster 
isn't missing anything.  So if there is anything the RGW developers would like 
to try on it to speed up the gc process, we should be able to do that.

Bryan

From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of David Turner 
<drakonst...@gmail.com>
Date: Tuesday, October 24, 2017 at 4:07 PM
To: Ben Hines <bhi...@gmail.com>
Cc: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Speeding up garbage collection in RGW

Thank you so much for chiming in, Ben.

Can you explain what each setting value means? I believe I understand min wait, 
that's just how long to wait before allowing the object to be cleaned up.  gc 
max objs is how many will be cleaned up during each period?  gc processor 
period is how often it will kick off gc to clean things up?  And gc processor 
max time is the longest the process can run after the period starts?  Is that 
about right for that?  I read somewhere saying that prime numbers are optimal 
for gc max objs.  Do you know why that is?  I notice you're using one there.  
What is lc max objs?  I couldn't find a reference for that setting.

Additionally, do you know if the radosgw-admin gc list is ever cleaned up, or 
is it an ever growing list?  I got up to 3.6 Billion objects in the list before 
I killed the gc list command.

On Tue, Oct 24, 2017 at 4:47 PM Ben Hines <bhi...@gmail.com> wrote:
I agree the settings are rather confusing. We also have many millions of 
objects and had this trouble, so i set these rather aggressive gc settings on 
our cluster which result in gc almost always running. We also use lifecycles to 
expire objects. 

rgw lifecycle work time = 00:01-23:59
rgw gc max objs = 2647
rgw lc max objs = 2647
rgw gc obj min wait = 300
rgw gc processor period = 600
rgw gc processor max time = 600


-Ben

On Tue, Oct 24, 2017 at 9:25 AM, David Turner <drakonst...@gmail.com> wrote:
As I'm looking into this more and more, I'm realizing how big of a problem 
garbage collection has been in our clusters.  The biggest cluster has over 1 
billion objects in its gc list (the command is still running, it just recently 
passed by the 1B mark).  Does anyone have any guidance on what to do to 
optimize the gc settings to hopefully/eventually catch up on this as well as 
stay caught up once we are?  I'm not expecting an overnight fix, but something 
that could feasibly be caught up within 6 months would be wonderful.

On Mon, Oct 23, 2017 at 11:18 AM David Turner <drakonst...@gmail.com> wrote:
We recently deleted a bucket that was no longer needed that had 400TB of data 
in it to help as our cluster is getting quite full.  That should free up about 
30% of our cluster used space, but in the last week we haven't seen nearly a 
fraction of that free up yet.  I left the cluster with this running over the 
weekend to try to help `radosgw-admin --rgw-realm=local gc process`, but it 
didn't seem to put a dent into it.  Our regular ingestion is faster than how 
fast the garbage collection is cleaning stuff up, but our regular ingestion is 
less than 2% growth at it's maximum. 

As of yesterday our gc list was over 350GB when dumped into a file (I had to 
stop it as the disk I was redirecting the output to was almost full).  In the 
future I will use the --bypass-gc option to avoid the cleanup, but is there a 
way to speed up the gc once yo

Re: [ceph-users] Speeding up garbage collection in RGW

2017-10-24 Thread David Turner
Thank you so much for chiming in, Ben.

Can you explain what each setting value means? I believe I understand min
wait, that's just how long to wait before allowing the object to be cleaned
up.  gc max objs is how many will be cleaned up during each period?  gc
processor period is how often it will kick off gc to clean things up?  And
gc processor max time is the longest the process can run after the period
starts?  Is that about right for that?  I read somewhere saying that prime
numbers are optimal for gc max objs.  Do you know why that is?  I notice
you're using one there.  What is lc max objs?  I couldn't find a reference
for that setting.

Additionally, do you know if the radosgw-admin gc list is ever cleaned up,
or is it an ever growing list?  I got up to 3.6 Billion objects in the list
before I killed the gc list command.

On Tue, Oct 24, 2017 at 4:47 PM Ben Hines <bhi...@gmail.com> wrote:

> I agree the settings are rather confusing. We also have many millions of
> objects and had this trouble, so i set these rather aggressive gc settings
> on our cluster which result in gc almost always running. We also use
> lifecycles to expire objects.
>
> rgw lifecycle work time = 00:01-23:59
> rgw gc max objs = 2647
> rgw lc max objs = 2647
> rgw gc obj min wait = 300
> rgw gc processor period = 600
> rgw gc processor max time = 600
>
>
> -Ben
>
> On Tue, Oct 24, 2017 at 9:25 AM, David Turner <drakonst...@gmail.com>
> wrote:
>
>> As I'm looking into this more and more, I'm realizing how big of a
>> problem garbage collection has been in our clusters.  The biggest cluster
>> has over 1 billion objects in its gc list (the command is still running, it
>> just recently passed by the 1B mark).  Does anyone have any guidance on
>> what to do to optimize the gc settings to hopefully/eventually catch up on
>> this as well as stay caught up once we are?  I'm not expecting an overnight
>> fix, but something that could feasibly be caught up within 6 months would
>> be wonderful.
>>
>> On Mon, Oct 23, 2017 at 11:18 AM David Turner <drakonst...@gmail.com>
>> wrote:
>>
>>> We recently deleted a bucket that was no longer needed that had 400TB of
>>> data in it to help as our cluster is getting quite full.  That should free
>>> up about 30% of our cluster used space, but in the last week we haven't
>>> seen nearly a fraction of that free up yet.  I left the cluster with this
>>> running over the weekend to try to help `radosgw-admin --rgw-realm=local gc
>>> process`, but it didn't seem to put a dent into it.  Our regular ingestion
>>> is faster than how fast the garbage collection is cleaning stuff up, but
>>> our regular ingestion is less than 2% growth at it's maximum.
>>>
>>> As of yesterday our gc list was over 350GB when dumped into a file (I
>>> had to stop it as the disk I was redirecting the output to was almost
>>> full).  In the future I will use the --bypass-gc option to avoid the
>>> cleanup, but is there a way to speed up the gc once you're in this
>>> position?  There were about 8M objects that were deleted from this bucket.
>>> I've come across a few references to the rgw-gc settings in the config, but
>>> nothing that explained the times well enough for me to feel comfortable
>>> doing anything with them.
>>>
>>> On Tue, Jul 25, 2017 at 4:01 PM Bryan Stillwell <bstillw...@godaddy.com>
>>> wrote:
>>>
>>>> Excellent, thank you!  It does exist in 0.94.10!  :)
>>>>
>>>>
>>>>
>>>> Bryan
>>>>
>>>>
>>>>
>>>> *From: *Pavan Rallabhandi <prallabha...@walmartlabs.com>
>>>> *Date: *Tuesday, July 25, 2017 at 11:21 AM
>>>>
>>>>
>>>> *To: *Bryan Stillwell <bstillw...@godaddy.com>, "
>>>> ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
>>>> *Subject: *Re: [ceph-users] Speeding up garbage collection in RGW
>>>>
>>>>
>>>>
>>>> I’ve just realized that the option is present in Hammer (0.94.10) as
>>>> well, you should try that.
>>>>
>>>>
>>>>
>>>> *From: *Bryan Stillwell <bstillw...@godaddy.com>
>>>> *Date: *Tuesday, 25 July 2017 at 9:45 PM
>>>> *To: *Pavan Rallabhandi <prallabha...@walmartlabs.com>, "
>>>> ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
>>>> *Subject: *EXT: Re: [ceph-users] Speeding up garbage collection in RGW
>>>>
>>>>
>>>>
>>>&

Re: [ceph-users] Speeding up garbage collection in RGW

2017-10-23 Thread David Turner
We recently deleted a bucket that was no longer needed that had 400TB of
data in it to help as our cluster is getting quite full.  That should free
up about 30% of our cluster used space, but in the last week we haven't
seen nearly a fraction of that free up yet.  I left the cluster with this
running over the weekend to try to help `radosgw-admin --rgw-realm=local gc
process`, but it didn't seem to put a dent into it.  Our regular ingestion
is faster than how fast the garbage collection is cleaning stuff up, but
our regular ingestion is less than 2% growth at it's maximum.

As of yesterday our gc list was over 350GB when dumped into a file (I had
to stop it as the disk I was redirecting the output to was almost full).
In the future I will use the --bypass-gc option to avoid the cleanup, but
is there a way to speed up the gc once you're in this position?  There were
about 8M objects that were deleted from this bucket.  I've come across a
few references to the rgw-gc settings in the config, but nothing that
explained the times well enough for me to feel comfortable doing anything
with them.

On Tue, Jul 25, 2017 at 4:01 PM Bryan Stillwell <bstillw...@godaddy.com>
wrote:

> Excellent, thank you!  It does exist in 0.94.10!  :)
>
>
>
> Bryan
>
>
>
> *From: *Pavan Rallabhandi <prallabha...@walmartlabs.com>
> *Date: *Tuesday, July 25, 2017 at 11:21 AM
>
>
> *To: *Bryan Stillwell <bstillw...@godaddy.com>, "ceph-users@lists.ceph.com"
> <ceph-users@lists.ceph.com>
> *Subject: *Re: [ceph-users] Speeding up garbage collection in RGW
>
>
>
> I’ve just realized that the option is present in Hammer (0.94.10) as well,
> you should try that.
>
>
>
> *From: *Bryan Stillwell <bstillw...@godaddy.com>
> *Date: *Tuesday, 25 July 2017 at 9:45 PM
> *To: *Pavan Rallabhandi <prallabha...@walmartlabs.com>, "
> ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
> *Subject: *EXT: Re: [ceph-users] Speeding up garbage collection in RGW
>
>
>
> Unfortunately, we're on hammer still (0.94.10).  That option looks like it
> would work better, so maybe it's time to move the upgrade up in the
> schedule.
>
>
>
> I've been playing with the various gc options and I haven't seen any
> speedups like we would need to remove them in a reasonable amount of time.
>
>
>
> Thanks,
>
> Bryan
>
>
>
> *From: *Pavan Rallabhandi <prallabha...@walmartlabs.com>
> *Date: *Tuesday, July 25, 2017 at 3:00 AM
> *To: *Bryan Stillwell <bstillw...@godaddy.com>, "ceph-users@lists.ceph.com"
> <ceph-users@lists.ceph.com>
> *Subject: *Re: [ceph-users] Speeding up garbage collection in RGW
>
>
>
> If your Ceph version is >=Jewel, you can try the `--bypass-gc` option in
> radosgw-admin, which would remove the tails objects as well without marking
> them to be GCed.
>
>
>
> Thanks,
>
>
>
> On 25/07/17, 1:34 AM, "ceph-users on behalf of Bryan Stillwell" <
> ceph-users-boun...@lists.ceph.com on behalf of bstillw...@godaddy.com>
> wrote:
>
>
>
> I'm in the process of cleaning up a test that an internal customer did
> on our production cluster that produced over a billion objects spread
> across 6000 buckets.  So far I've been removing the buckets like this:
>
>
>
> printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin
> bucket rm --bucket={} --purge-objects
>
>
>
> However, the disk usage doesn't seem to be getting reduced at the same
> rate the objects are being removed.  From what I can tell a large number of
> the objects are waiting for garbage collection.
>
>
>
> When I first read the docs it sounded like the garbage collector would
> only remove 32 objects every hour, but after looking through the logs I'm
> seeing about 55,000 objects removed every hour.  That's about 1.3 million a
> day, so at this rate it'll take a couple years to clean up the rest!  For
> comparison, the purge-objects command above is removing (but not GC'ing)
> about 30 million objects a day, so a much more manageable 33 days to finish.
>
>
>
> I've done some digging and it appears like I should be changing these
> configuration options:
>
>
>
> rgw gc max objs (default: 32)
>
> rgw gc obj min wait (default: 7200)
>
> rgw gc processor max time (default: 3600)
>
> rgw gc processor period (default: 3600)
>
>
>
> A few questions I have though are:
>
>
>
> Should 'rgw gc processor max time' and 'rgw gc processor period'
> always be set to the same value?
>
>
>
> Which would be better, increasing 'rgw gc max objs' to something like
> 1024, or reducing the 'rgw gc pro

Re: [ceph-users] Speeding up garbage collection in RGW

2017-07-25 Thread Bryan Stillwell
Excellent, thank you!  It does exist in 0.94.10!  :)

Bryan

From: Pavan Rallabhandi <prallabha...@walmartlabs.com>
Date: Tuesday, July 25, 2017 at 11:21 AM
To: Bryan Stillwell <bstillw...@godaddy.com>, "ceph-users@lists.ceph.com" 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Speeding up garbage collection in RGW

I’ve just realized that the option is present in Hammer (0.94.10) as well, you 
should try that.

From: Bryan Stillwell <bstillw...@godaddy.com>
Date: Tuesday, 25 July 2017 at 9:45 PM
To: Pavan Rallabhandi <prallabha...@walmartlabs.com>, 
"ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
Subject: EXT: Re: [ceph-users] Speeding up garbage collection in RGW

Unfortunately, we're on hammer still (0.94.10).  That option looks like it 
would work better, so maybe it's time to move the upgrade up in the schedule.

I've been playing with the various gc options and I haven't seen any speedups 
like we would need to remove them in a reasonable amount of time.

Thanks,
Bryan

From: Pavan Rallabhandi <prallabha...@walmartlabs.com>
Date: Tuesday, July 25, 2017 at 3:00 AM
To: Bryan Stillwell <bstillw...@godaddy.com>, "ceph-users@lists.ceph.com" 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Speeding up garbage collection in RGW

If your Ceph version is >=Jewel, you can try the `--bypass-gc` option in 
radosgw-admin, which would remove the tails objects as well without marking 
them to be GCed.

Thanks,

On 25/07/17, 1:34 AM, "ceph-users on behalf of Bryan Stillwell" 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com> on 
behalf of bstillw...@godaddy.com<mailto:bstillw...@godaddy.com>> wrote:

I'm in the process of cleaning up a test that an internal customer did on 
our production cluster that produced over a billion objects spread across 6000 
buckets.  So far I've been removing the buckets like this:

printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket 
rm --bucket={} --purge-objects

However, the disk usage doesn't seem to be getting reduced at the same rate 
the objects are being removed.  From what I can tell a large number of the 
objects are waiting for garbage collection.

When I first read the docs it sounded like the garbage collector would only 
remove 32 objects every hour, but after looking through the logs I'm seeing 
about 55,000 objects removed every hour.  That's about 1.3 million a day, so at 
this rate it'll take a couple years to clean up the rest!  For comparison, the 
purge-objects command above is removing (but not GC'ing) about 30 million 
objects a day, so a much more manageable 33 days to finish.

I've done some digging and it appears like I should be changing these 
configuration options:

rgw gc max objs (default: 32)
rgw gc obj min wait (default: 7200)
rgw gc processor max time (default: 3600)
rgw gc processor period (default: 3600)

A few questions I have though are:

Should 'rgw gc processor max time' and 'rgw gc processor period' always be 
set to the same value?

Which would be better, increasing 'rgw gc max objs' to something like 1024, 
or reducing the 'rgw gc processor' times to something like 60 seconds?

Any other guidance on the best way to adjust these values?

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up garbage collection in RGW

2017-07-25 Thread Pavan Rallabhandi
I’ve just realized that the option is present in Hammer (0.94.10) as well, you 
should try that.

From: Bryan Stillwell <bstillw...@godaddy.com>
Date: Tuesday, 25 July 2017 at 9:45 PM
To: Pavan Rallabhandi <prallabha...@walmartlabs.com>, 
"ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
Subject: EXT: Re: [ceph-users] Speeding up garbage collection in RGW

Unfortunately, we're on hammer still (0.94.10).  That option looks like it 
would work better, so maybe it's time to move the upgrade up in the schedule.

I've been playing with the various gc options and I haven't seen any speedups 
like we would need to remove them in a reasonable amount of time.

Thanks,
Bryan

From: Pavan Rallabhandi <prallabha...@walmartlabs.com>
Date: Tuesday, July 25, 2017 at 3:00 AM
To: Bryan Stillwell <bstillw...@godaddy.com>, "ceph-users@lists.ceph.com" 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Speeding up garbage collection in RGW

If your Ceph version is >=Jewel, you can try the `--bypass-gc` option in 
radosgw-admin, which would remove the tails objects as well without marking 
them to be GCed.

Thanks,

On 25/07/17, 1:34 AM, "ceph-users on behalf of Bryan Stillwell" 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com> on 
behalf of bstillw...@godaddy.com<mailto:bstillw...@godaddy.com>> wrote:

I'm in the process of cleaning up a test that an internal customer did on 
our production cluster that produced over a billion objects spread across 6000 
buckets.  So far I've been removing the buckets like this:

printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket 
rm --bucket={} --purge-objects

However, the disk usage doesn't seem to be getting reduced at the same rate 
the objects are being removed.  From what I can tell a large number of the 
objects are waiting for garbage collection.

When I first read the docs it sounded like the garbage collector would only 
remove 32 objects every hour, but after looking through the logs I'm seeing 
about 55,000 objects removed every hour.  That's about 1.3 million a day, so at 
this rate it'll take a couple years to clean up the rest!  For comparison, the 
purge-objects command above is removing (but not GC'ing) about 30 million 
objects a day, so a much more manageable 33 days to finish.

I've done some digging and it appears like I should be changing these 
configuration options:

rgw gc max objs (default: 32)
rgw gc obj min wait (default: 7200)
rgw gc processor max time (default: 3600)
rgw gc processor period (default: 3600)

A few questions I have though are:

Should 'rgw gc processor max time' and 'rgw gc processor period' always be 
set to the same value?

Which would be better, increasing 'rgw gc max objs' to something like 1024, 
or reducing the 'rgw gc processor' times to something like 60 seconds?

Any other guidance on the best way to adjust these values?

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up garbage collection in RGW

2017-07-25 Thread Bryan Stillwell
Unfortunately, we're on hammer still (0.94.10).  That option looks like it 
would work better, so maybe it's time to move the upgrade up in the schedule.

I've been playing with the various gc options and I haven't seen any speedups 
like we would need to remove them in a reasonable amount of time.

Thanks,
Bryan

From: Pavan Rallabhandi <prallabha...@walmartlabs.com>
Date: Tuesday, July 25, 2017 at 3:00 AM
To: Bryan Stillwell <bstillw...@godaddy.com>, "ceph-users@lists.ceph.com" 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Speeding up garbage collection in RGW

If your Ceph version is >=Jewel, you can try the `--bypass-gc` option in 
radosgw-admin, which would remove the tails objects as well without marking 
them to be GCed.

Thanks,

On 25/07/17, 1:34 AM, "ceph-users on behalf of Bryan Stillwell" 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com> on 
behalf of bstillw...@godaddy.com<mailto:bstillw...@godaddy.com>> wrote:

I'm in the process of cleaning up a test that an internal customer did on 
our production cluster that produced over a billion objects spread across 6000 
buckets.  So far I've been removing the buckets like this:

printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket 
rm --bucket={} --purge-objects

However, the disk usage doesn't seem to be getting reduced at the same rate 
the objects are being removed.  From what I can tell a large number of the 
objects are waiting for garbage collection.

When I first read the docs it sounded like the garbage collector would only 
remove 32 objects every hour, but after looking through the logs I'm seeing 
about 55,000 objects removed every hour.  That's about 1.3 million a day, so at 
this rate it'll take a couple years to clean up the rest!  For comparison, the 
purge-objects command above is removing (but not GC'ing) about 30 million 
objects a day, so a much more manageable 33 days to finish.

I've done some digging and it appears like I should be changing these 
configuration options:

rgw gc max objs (default: 32)
rgw gc obj min wait (default: 7200)
rgw gc processor max time (default: 3600)
rgw gc processor period (default: 3600)

A few questions I have though are:

Should 'rgw gc processor max time' and 'rgw gc processor period' always be 
set to the same value?

Which would be better, increasing 'rgw gc max objs' to something like 1024, 
or reducing the 'rgw gc processor' times to something like 60 seconds?

Any other guidance on the best way to adjust these values?

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up garbage collection in RGW

2017-07-25 Thread Pavan Rallabhandi
If your Ceph version is >=Jewel, you can try the `--bypass-gc` option in 
radosgw-admin, which would remove the tails objects as well without marking 
them to be GCed.

Thanks,

On 25/07/17, 1:34 AM, "ceph-users on behalf of Bryan Stillwell" 
 wrote:

I'm in the process of cleaning up a test that an internal customer did on 
our production cluster that produced over a billion objects spread across 6000 
buckets.  So far I've been removing the buckets like this:

printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket 
rm --bucket={} --purge-objects

However, the disk usage doesn't seem to be getting reduced at the same rate 
the objects are being removed.  From what I can tell a large number of the 
objects are waiting for garbage collection.

When I first read the docs it sounded like the garbage collector would only 
remove 32 objects every hour, but after looking through the logs I'm seeing 
about 55,000 objects removed every hour.  That's about 1.3 million a day, so at 
this rate it'll take a couple years to clean up the rest!  For comparison, the 
purge-objects command above is removing (but not GC'ing) about 30 million 
objects a day, so a much more manageable 33 days to finish.

I've done some digging and it appears like I should be changing these 
configuration options:

rgw gc max objs (default: 32)
rgw gc obj min wait (default: 7200)
rgw gc processor max time (default: 3600)
rgw gc processor period (default: 3600)

A few questions I have though are:

Should 'rgw gc processor max time' and 'rgw gc processor period' always be 
set to the same value?

Which would be better, increasing 'rgw gc max objs' to something like 1024, 
or reducing the 'rgw gc processor' times to something like 60 seconds?

Any other guidance on the best way to adjust these values?

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up garbage collection in RGW

2017-07-24 Thread Z Will
I think if you want to delete through gc,
increase this
OPTION(rgw_gc_processor_max_time, OPT_INT, 3600)  // total run time
for a single gc processor work
decrease this
OPTION(rgw_gc_processor_period, OPT_INT, 3600)  // gc processor cycle time

Or , I think if there is some option to bypass the gc


On Tue, Jul 25, 2017 at 5:05 AM, Bryan Stillwell <bstillw...@godaddy.com> wrote:
> Wouldn't doing it that way cause problems since references to the objects 
> wouldn't be getting removed from .rgw.buckets.index?
>
> Bryan
>
> From: Roger Brown <rogerpbr...@gmail.com>
> Date: Monday, July 24, 2017 at 2:43 PM
> To: Bryan Stillwell <bstillw...@godaddy.com>, "ceph-users@lists.ceph.com" 
> <ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] Speeding up garbage collection in RGW
>
> I hope someone else can answer your question better, but in my case I found 
> something like this helpful to delete objects faster than I could through the 
> gateway:
>
> rados -p default.rgw.buckets.data ls | grep 'replace this with pattern 
> matching files you want to delete' | xargs -d '\n' -n 200 rados -p 
> default.rgw.buckets.data rm
>
>
> On Mon, Jul 24, 2017 at 2:02 PM Bryan Stillwell <bstillw...@godaddy.com> 
> wrote:
> I'm in the process of cleaning up a test that an internal customer did on our 
> production cluster that produced over a billion objects spread across 6000 
> buckets.  So far I've been removing the buckets like this:
>
> printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket rm 
> --bucket={} --purge-objects
>
> However, the disk usage doesn't seem to be getting reduced at the same rate 
> the objects are being removed.  From what I can tell a large number of the 
> objects are waiting for garbage collection.
>
> When I first read the docs it sounded like the garbage collector would only 
> remove 32 objects every hour, but after looking through the logs I'm seeing 
> about 55,000 objects removed every hour.  That's about 1.3 million a day, so 
> at this rate it'll take a couple years to clean up the rest!  For comparison, 
> the purge-objects command above is removing (but not GC'ing) about 30 million 
> objects a day, so a much more manageable 33 days to finish.
>
> I've done some digging and it appears like I should be changing these 
> configuration options:
>
> rgw gc max objs (default: 32)
> rgw gc obj min wait (default: 7200)
> rgw gc processor max time (default: 3600)
> rgw gc processor period (default: 3600)
>
> A few questions I have though are:
>
> Should 'rgw gc processor max time' and 'rgw gc processor period' always be 
> set to the same value?
>
> Which would be better, increasing 'rgw gc max objs' to something like 1024, 
> or reducing the 'rgw gc processor' times to something like 60 seconds?
>
> Any other guidance on the best way to adjust these values?
>
> Thanks,
> Bryan
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up garbage collection in RGW

2017-07-24 Thread Bryan Stillwell
Wouldn't doing it that way cause problems since references to the objects 
wouldn't be getting removed from .rgw.buckets.index?

Bryan

From: Roger Brown <rogerpbr...@gmail.com>
Date: Monday, July 24, 2017 at 2:43 PM
To: Bryan Stillwell <bstillw...@godaddy.com>, "ceph-users@lists.ceph.com" 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Speeding up garbage collection in RGW

I hope someone else can answer your question better, but in my case I found 
something like this helpful to delete objects faster than I could through the 
gateway: 

rados -p default.rgw.buckets.data ls | grep 'replace this with pattern matching 
files you want to delete' | xargs -d '\n' -n 200 rados -p 
default.rgw.buckets.data rm


On Mon, Jul 24, 2017 at 2:02 PM Bryan Stillwell <bstillw...@godaddy.com> wrote:
I'm in the process of cleaning up a test that an internal customer did on our 
production cluster that produced over a billion objects spread across 6000 
buckets.  So far I've been removing the buckets like this:

printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket rm 
--bucket={} --purge-objects

However, the disk usage doesn't seem to be getting reduced at the same rate the 
objects are being removed.  From what I can tell a large number of the objects 
are waiting for garbage collection.

When I first read the docs it sounded like the garbage collector would only 
remove 32 objects every hour, but after looking through the logs I'm seeing 
about 55,000 objects removed every hour.  That's about 1.3 million a day, so at 
this rate it'll take a couple years to clean up the rest!  For comparison, the 
purge-objects command above is removing (but not GC'ing) about 30 million 
objects a day, so a much more manageable 33 days to finish.

I've done some digging and it appears like I should be changing these 
configuration options:

rgw gc max objs (default: 32)
rgw gc obj min wait (default: 7200)
rgw gc processor max time (default: 3600)
rgw gc processor period (default: 3600)

A few questions I have though are:

Should 'rgw gc processor max time' and 'rgw gc processor period' always be set 
to the same value?

Which would be better, increasing 'rgw gc max objs' to something like 1024, or 
reducing the 'rgw gc processor' times to something like 60 seconds?

Any other guidance on the best way to adjust these values?

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up garbage collection in RGW

2017-07-24 Thread Roger Brown
I hope someone else can answer your question better, but in my case I found
something like this helpful to delete objects faster than I could through
the gateway:

rados -p default.rgw.buckets.data ls | grep 'replace this with pattern
matching files you want to delete' | xargs -d '\n' -n 200 rados -p
default.rgw.buckets.data rm


On Mon, Jul 24, 2017 at 2:02 PM Bryan Stillwell 
wrote:

> I'm in the process of cleaning up a test that an internal customer did on
> our production cluster that produced over a billion objects spread across
> 6000 buckets.  So far I've been removing the buckets like this:
>
> printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket
> rm --bucket={} --purge-objects
>
> However, the disk usage doesn't seem to be getting reduced at the same
> rate the objects are being removed.  From what I can tell a large number of
> the objects are waiting for garbage collection.
>
> When I first read the docs it sounded like the garbage collector would
> only remove 32 objects every hour, but after looking through the logs I'm
> seeing about 55,000 objects removed every hour.  That's about 1.3 million a
> day, so at this rate it'll take a couple years to clean up the rest!  For
> comparison, the purge-objects command above is removing (but not GC'ing)
> about 30 million objects a day, so a much more manageable 33 days to finish.
>
> I've done some digging and it appears like I should be changing these
> configuration options:
>
> rgw gc max objs (default: 32)
> rgw gc obj min wait (default: 7200)
> rgw gc processor max time (default: 3600)
> rgw gc processor period (default: 3600)
>
> A few questions I have though are:
>
> Should 'rgw gc processor max time' and 'rgw gc processor period' always be
> set to the same value?
>
> Which would be better, increasing 'rgw gc max objs' to something like
> 1024, or reducing the 'rgw gc processor' times to something like 60 seconds?
>
> Any other guidance on the best way to adjust these values?
>
> Thanks,
> Bryan
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Speeding up garbage collection in RGW

2017-07-24 Thread Bryan Stillwell
I'm in the process of cleaning up a test that an internal customer did on our 
production cluster that produced over a billion objects spread across 6000 
buckets.  So far I've been removing the buckets like this:

printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket rm 
--bucket={} --purge-objects

However, the disk usage doesn't seem to be getting reduced at the same rate the 
objects are being removed.  From what I can tell a large number of the objects 
are waiting for garbage collection.

When I first read the docs it sounded like the garbage collector would only 
remove 32 objects every hour, but after looking through the logs I'm seeing 
about 55,000 objects removed every hour.  That's about 1.3 million a day, so at 
this rate it'll take a couple years to clean up the rest!  For comparison, the 
purge-objects command above is removing (but not GC'ing) about 30 million 
objects a day, so a much more manageable 33 days to finish.

I've done some digging and it appears like I should be changing these 
configuration options:

rgw gc max objs (default: 32)
rgw gc obj min wait (default: 7200)
rgw gc processor max time (default: 3600)
rgw gc processor period (default: 3600)

A few questions I have though are:

Should 'rgw gc processor max time' and 'rgw gc processor period' always be set 
to the same value?

Which would be better, increasing 'rgw gc max objs' to something like 1024, or 
reducing the 'rgw gc processor' times to something like 60 seconds?

Any other guidance on the best way to adjust these values?

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com