Re: [ceph-users] radosgw: scrub causing slow requests in the md log
On Thu, Jun 22, 2017 at 5:31 PM, Casey Bodleywrote: > > On 06/22/2017 10:40 AM, Dan van der Ster wrote: >> >> On Thu, Jun 22, 2017 at 4:25 PM, Casey Bodley wrote: >>> >>> On 06/22/2017 04:00 AM, Dan van der Ster wrote: I'm now running the three relevant OSDs with that patch. (Recompiled, replaced /usr/lib64/rados-classes/libcls_log.so with the new version, then restarted the osds). It's working quite well, trimming 10 entries at a time instead of 1000, and no more timeouts. Do you think it would be worth decreasing this hardcoded value in ceph proper? -- Dan >>> >>> >>> I do, yeah. At least, the trim operation should be able to pass in its >>> own >>> value for that. I opened a ticket for that at >>> http://tracker.ceph.com/issues/20382. >>> >>> I'd also like to investigate using the ObjectStore's OP_OMAP_RMKEYRANGE >>> operation to trim a range of keys in a single osd op, instead of >>> generating >>> a different op for each key. I have a PR that does this at >>> https://github.com/ceph/ceph/pull/15183. But it's still hard to guarantee >>> that leveldb can process the entire range inside of the suicide timeout. >> >> I wonder if that would help. Here's what I've learned today: >> >>* two of the 3 relevant OSDs have something screwy with their >> leveldb. The primary and 3rd replica are ~quick at trimming for only a >> few hundred keys, whilst the 2nd OSD is very very fast always. >>* After manually compacting the two slow OSDs, they are fast again >> for just a few hundred trims. So I'm compacting, trimming, ..., in a >> loop now. >>* I moved the omaps to SSDs -- doesn't help. (iostat confirms this >> is not IO bound). >>* CPU util on the slow OSDs gets quite high during the slow trimming. >>* perf top is below [1]. leveldb::Block::Iter::Prev and >> leveldb::InternalKeyComparator::Compare are notable. >>* The always fast OSD shows no leveldb functions in perf top while >> trimming. >> >> I've tried bigger leveldb cache and block sizes, compression on and >> off, and played with the bloom size up to 14 bits -- none of these >> changes make any difference. >> >> At this point I'm not confident this trimming will ever complete -- >> there are ~20 million records to remove at maybe 1Hz. >> >> How about I just delete the meta.log object? Would this use a >> different, perhaps quicker, code path to remove those omap keys? >> >> Thanks! >> >> Dan >> >> [1] >> >> 4.92% libtcmalloc.so.4.2.6;5873e42b (deleted) [.] >> 0x00023e8d >> 4.47% libc-2.17.so [.] >> __memcmp_sse4_1 >> 4.13% libtcmalloc.so.4.2.6;5873e42b (deleted) [.] >> 0x000273bb >> 3.81% libleveldb.so.1.0.7 [.] >> leveldb::Block::Iter::Prev >> 3.07% libc-2.17.so [.] >> __memcpy_ssse3_back >> 2.84% [kernel] [k] port_inb >> 2.77% libstdc++.so.6.0.19 [.] >> std::string::_M_mutate >> 2.75% libstdc++.so.6.0.19 [.] >> std::string::append >> 2.53% libleveldb.so.1.0.7 [.] >> leveldb::InternalKeyComparator::Compare >> 1.32% libtcmalloc.so.4.2.6;5873e42b (deleted) [.] >> 0x00023e77 >> 0.85% [kernel] [k] >> _raw_spin_lock >> 0.80% libleveldb.so.1.0.7 [.] >> leveldb::Block::Iter::Next >> 0.77% libtcmalloc.so.4.2.6;5873e42b (deleted) [.] >> 0x00023a05 >> 0.67% libleveldb.so.1.0.7 [.] >> leveldb::MemTable::KeyComparator::operator() >> 0.61% libtcmalloc.so.4.2.6;5873e42b (deleted) [.] >> 0x00023a09 >> 0.58% libleveldb.so.1.0.7 [.] >> leveldb::MemTableIterator::Prev >> 0.51% [kernel] [k] __schedule >> 0.48% libruby.so.2.1.0 [.] >> ruby_yyparse > > > Hi Dan, > > Removing an object will try to delete all of its keys at once, which should > be much faster. It's also very likely to hit your suicide timeout, so you'll > have to keep retrying until it stops killing your osd. Well, that was quick. The object delete took around 30s. I then restarted the osd to compact it, and now the leveldb is ~100MB. Phew! In summary, if someone finds themselves in this predicament (huge mdlog on a single-region rgw cluster), I'd advise to turn it off, then just delete the meta.log objects. Thanks! Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw: scrub causing slow requests in the md log
On 06/22/2017 10:40 AM, Dan van der Ster wrote: On Thu, Jun 22, 2017 at 4:25 PM, Casey Bodleywrote: On 06/22/2017 04:00 AM, Dan van der Ster wrote: I'm now running the three relevant OSDs with that patch. (Recompiled, replaced /usr/lib64/rados-classes/libcls_log.so with the new version, then restarted the osds). It's working quite well, trimming 10 entries at a time instead of 1000, and no more timeouts. Do you think it would be worth decreasing this hardcoded value in ceph proper? -- Dan I do, yeah. At least, the trim operation should be able to pass in its own value for that. I opened a ticket for that at http://tracker.ceph.com/issues/20382. I'd also like to investigate using the ObjectStore's OP_OMAP_RMKEYRANGE operation to trim a range of keys in a single osd op, instead of generating a different op for each key. I have a PR that does this at https://github.com/ceph/ceph/pull/15183. But it's still hard to guarantee that leveldb can process the entire range inside of the suicide timeout. I wonder if that would help. Here's what I've learned today: * two of the 3 relevant OSDs have something screwy with their leveldb. The primary and 3rd replica are ~quick at trimming for only a few hundred keys, whilst the 2nd OSD is very very fast always. * After manually compacting the two slow OSDs, they are fast again for just a few hundred trims. So I'm compacting, trimming, ..., in a loop now. * I moved the omaps to SSDs -- doesn't help. (iostat confirms this is not IO bound). * CPU util on the slow OSDs gets quite high during the slow trimming. * perf top is below [1]. leveldb::Block::Iter::Prev and leveldb::InternalKeyComparator::Compare are notable. * The always fast OSD shows no leveldb functions in perf top while trimming. I've tried bigger leveldb cache and block sizes, compression on and off, and played with the bloom size up to 14 bits -- none of these changes make any difference. At this point I'm not confident this trimming will ever complete -- there are ~20 million records to remove at maybe 1Hz. How about I just delete the meta.log object? Would this use a different, perhaps quicker, code path to remove those omap keys? Thanks! Dan [1] 4.92% libtcmalloc.so.4.2.6;5873e42b (deleted) [.] 0x00023e8d 4.47% libc-2.17.so [.] __memcmp_sse4_1 4.13% libtcmalloc.so.4.2.6;5873e42b (deleted) [.] 0x000273bb 3.81% libleveldb.so.1.0.7 [.] leveldb::Block::Iter::Prev 3.07% libc-2.17.so [.] __memcpy_ssse3_back 2.84% [kernel] [k] port_inb 2.77% libstdc++.so.6.0.19 [.] std::string::_M_mutate 2.75% libstdc++.so.6.0.19 [.] std::string::append 2.53% libleveldb.so.1.0.7 [.] leveldb::InternalKeyComparator::Compare 1.32% libtcmalloc.so.4.2.6;5873e42b (deleted) [.] 0x00023e77 0.85% [kernel] [k] _raw_spin_lock 0.80% libleveldb.so.1.0.7 [.] leveldb::Block::Iter::Next 0.77% libtcmalloc.so.4.2.6;5873e42b (deleted) [.] 0x00023a05 0.67% libleveldb.so.1.0.7 [.] leveldb::MemTable::KeyComparator::operator() 0.61% libtcmalloc.so.4.2.6;5873e42b (deleted) [.] 0x00023a09 0.58% libleveldb.so.1.0.7 [.] leveldb::MemTableIterator::Prev 0.51% [kernel] [k] __schedule 0.48% libruby.so.2.1.0 [.] ruby_yyparse Hi Dan, Removing an object will try to delete all of its keys at once, which should be much faster. It's also very likely to hit your suicide timeout, so you'll have to keep retrying until it stops killing your osd. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw: scrub causing slow requests in the md log
On Thu, Jun 22, 2017 at 4:25 PM, Casey Bodleywrote: > > On 06/22/2017 04:00 AM, Dan van der Ster wrote: >> >> I'm now running the three relevant OSDs with that patch. (Recompiled, >> replaced /usr/lib64/rados-classes/libcls_log.so with the new version, >> then restarted the osds). >> >> It's working quite well, trimming 10 entries at a time instead of >> 1000, and no more timeouts. >> >> Do you think it would be worth decreasing this hardcoded value in ceph >> proper? >> >> -- Dan > > > I do, yeah. At least, the trim operation should be able to pass in its own > value for that. I opened a ticket for that at > http://tracker.ceph.com/issues/20382. > > I'd also like to investigate using the ObjectStore's OP_OMAP_RMKEYRANGE > operation to trim a range of keys in a single osd op, instead of generating > a different op for each key. I have a PR that does this at > https://github.com/ceph/ceph/pull/15183. But it's still hard to guarantee > that leveldb can process the entire range inside of the suicide timeout. I wonder if that would help. Here's what I've learned today: * two of the 3 relevant OSDs have something screwy with their leveldb. The primary and 3rd replica are ~quick at trimming for only a few hundred keys, whilst the 2nd OSD is very very fast always. * After manually compacting the two slow OSDs, they are fast again for just a few hundred trims. So I'm compacting, trimming, ..., in a loop now. * I moved the omaps to SSDs -- doesn't help. (iostat confirms this is not IO bound). * CPU util on the slow OSDs gets quite high during the slow trimming. * perf top is below [1]. leveldb::Block::Iter::Prev and leveldb::InternalKeyComparator::Compare are notable. * The always fast OSD shows no leveldb functions in perf top while trimming. I've tried bigger leveldb cache and block sizes, compression on and off, and played with the bloom size up to 14 bits -- none of these changes make any difference. At this point I'm not confident this trimming will ever complete -- there are ~20 million records to remove at maybe 1Hz. How about I just delete the meta.log object? Would this use a different, perhaps quicker, code path to remove those omap keys? Thanks! Dan [1] 4.92% libtcmalloc.so.4.2.6;5873e42b (deleted) [.] 0x00023e8d 4.47% libc-2.17.so [.] __memcmp_sse4_1 4.13% libtcmalloc.so.4.2.6;5873e42b (deleted) [.] 0x000273bb 3.81% libleveldb.so.1.0.7 [.] leveldb::Block::Iter::Prev 3.07% libc-2.17.so [.] __memcpy_ssse3_back 2.84% [kernel] [k] port_inb 2.77% libstdc++.so.6.0.19 [.] std::string::_M_mutate 2.75% libstdc++.so.6.0.19 [.] std::string::append 2.53% libleveldb.so.1.0.7 [.] leveldb::InternalKeyComparator::Compare 1.32% libtcmalloc.so.4.2.6;5873e42b (deleted) [.] 0x00023e77 0.85% [kernel] [k] _raw_spin_lock 0.80% libleveldb.so.1.0.7 [.] leveldb::Block::Iter::Next 0.77% libtcmalloc.so.4.2.6;5873e42b (deleted) [.] 0x00023a05 0.67% libleveldb.so.1.0.7 [.] leveldb::MemTable::KeyComparator::operator() 0.61% libtcmalloc.so.4.2.6;5873e42b (deleted) [.] 0x00023a09 0.58% libleveldb.so.1.0.7 [.] leveldb::MemTableIterator::Prev 0.51% [kernel] [k] __schedule 0.48% libruby.so.2.1.0 [.] ruby_yyparse ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw: scrub causing slow requests in the md log
On 06/22/2017 04:00 AM, Dan van der Ster wrote: I'm now running the three relevant OSDs with that patch. (Recompiled, replaced /usr/lib64/rados-classes/libcls_log.so with the new version, then restarted the osds). It's working quite well, trimming 10 entries at a time instead of 1000, and no more timeouts. Do you think it would be worth decreasing this hardcoded value in ceph proper? -- Dan I do, yeah. At least, the trim operation should be able to pass in its own value for that. I opened a ticket for that at http://tracker.ceph.com/issues/20382. I'd also like to investigate using the ObjectStore's OP_OMAP_RMKEYRANGE operation to trim a range of keys in a single osd op, instead of generating a different op for each key. I have a PR that does this at https://github.com/ceph/ceph/pull/15183. But it's still hard to guarantee that leveldb can process the entire range inside of the suicide timeout. Casey On Wed, Jun 21, 2017 at 3:51 PM, Casey Bodleywrote: That patch looks reasonable. You could also try raising the values of osd_op_thread_suicide_timeout and filestore_op_thread_suicide_timeout on that osd in order to trim more at a time. On 06/21/2017 09:27 AM, Dan van der Ster wrote: Hi Casey, I managed to trim up all shards except for that big #54. The others all trimmed within a few seconds. But 54 is proving difficult. It's still going after several days, and now I see that the 1000-key trim is indeed causing osd timeouts. I've manually compacted the relevant osd leveldbs, but haven't found any way to speed up the trimming. It's now going at ~1-2Hz, so 1000 trims per op locks things up for quite awhile. I'm thinking of running those ceph-osd's with this patch: # git diff diff --git a/src/cls/log/cls_log.cc b/src/cls/log/cls_log.cc index 89745bb..7dcd933 100644 --- a/src/cls/log/cls_log.cc +++ b/src/cls/log/cls_log.cc @@ -254,7 +254,7 @@ static int cls_log_trim(cls_method_context_t hctx, bufferlist *in, bufferlist *o to_index = op.to_marker; } -#define MAX_TRIM_ENTRIES 1000 +#define MAX_TRIM_ENTRIES 10 size_t max_entries = MAX_TRIM_ENTRIES; int rc = cls_cxx_map_get_vals(hctx, from_index, log_index_prefix, max_entries, ); What do you think? -- Dan On Mon, Jun 19, 2017 at 5:32 PM, Casey Bodley wrote: Hi Dan, That's good news that it can remove 1000 keys at a time without hitting timeouts. The output of 'du' will depend on when the leveldb compaction runs. If you do find that compaction leads to suicide timeouts on this osd (you would see a lot of 'leveldb:' output in the log), consider running offline compaction by adding 'leveldb compact on mount = true' to the osd config and restarting. Casey On 06/19/2017 11:01 AM, Dan van der Ster wrote: On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodley wrote: On 06/14/2017 05:59 AM, Dan van der Ster wrote: Dear ceph users, Today we had O(100) slow requests which were caused by deep-scrubbing of the metadata log: 2017-06-14 11:07:55.373184 osd.155 [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d deep-scrub starts ... 2017-06-14 11:22:04.143903 osd.155 [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow request 480.140904 seconds old, received at 2017-06-14 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc 0=[] ondisk+write+known_if_redirected e7752) currently waiting for scrub ... 2017-06-14 11:22:06.729306 osd.155 [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d deep-scrub ok We have log_meta: true, log_data: false on this (our only) region [1], which IIRC we setup to enable indexless buckets. I'm obviously unfamiliar with rgw meta and data logging, and have a few questions: 1. AFAIU, it is used by the rgw multisite feature. Is it safe to turn it off when not using multisite? It's a good idea to turn that off, yes. First, make sure that you have configured a default realm/zonegroup/zone: $ radosgw-admin realm default --rgw-realm (you can determine realm name from 'radosgw-admin realm list') $ radosgw-admin zonegroup default --rgw-zonegroup default $ radosgw-admin zone default --rgw-zone default Thanks. This had already been done, as confirmed with radosgw-admin realm get-default. Then you can modify the zonegroup (aka region): $ radosgw-admin zonegroup get > zonegroup.json $ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json $ radosgw-admin zonegroup set < zonegroup.json Then commit the updated period configuration: $ radosgw-admin period update --commit Verify that the resulting period contains "log_meta": "false". Take care with future radosgw-admin commands on the zone/zonegroup, as they may revert log_meta back to true [1]. Great, this worked. FYI (and for others trying this in future), the period update --commit blocks all rgws for ~30s while they reload the realm. 2. I
Re: [ceph-users] radosgw: scrub causing slow requests in the md log
On Wed, Jun 21, 2017 at 4:16 PM, Peter Maloneywrote: > On 06/14/17 11:59, Dan van der Ster wrote: >> Dear ceph users, >> >> Today we had O(100) slow requests which were caused by deep-scrubbing >> of the metadata log: >> >> 2017-06-14 11:07:55.373184 osd.155 >> [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d >> deep-scrub starts >> ... >> 2017-06-14 11:22:04.143903 osd.155 >> [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow >> request 480.140904 seconds old, received at 2017-06-14 >> 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d >> meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc >> 0=[] ondisk+write+known_if_redirected e7752) currently waiting for >> scrub >> ... >> 2017-06-14 11:22:06.729306 osd.155 >> [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d >> deep-scrub ok > > This looks just like my problem in my thread on ceph-devel "another > scrub bug? blocked for > 10240.948831 secs" except that your scrub > eventually finished (mine ran hours before I stopped it manually), and > I'm not using rgw. > > Sage commented that it is likely related to snaps being removed at some > point and interacting with scrub. > > Restarting the osd that is mentioned there (osd.155 in your case) will > fix it for now. And tuning scrub changes the way it behaves (defaults > make it happen more rarely than what I had before). In my case it's not related to snaps -- there are no snaps (or trimming) in a (normal) set of rgw pools. My problem is about the cls_log class, which tries to do a lot of work in one op, timing out the osds. Well, the *real* problem in my case is about this rgw mdlog, which can grow unboundedly, then eventually become un-scrubbable, leading to this huge amount of cleanup to be done. -- dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw: scrub causing slow requests in the md log
I'm now running the three relevant OSDs with that patch. (Recompiled, replaced /usr/lib64/rados-classes/libcls_log.so with the new version, then restarted the osds). It's working quite well, trimming 10 entries at a time instead of 1000, and no more timeouts. Do you think it would be worth decreasing this hardcoded value in ceph proper? -- Dan On Wed, Jun 21, 2017 at 3:51 PM, Casey Bodleywrote: > That patch looks reasonable. You could also try raising the values of > osd_op_thread_suicide_timeout and filestore_op_thread_suicide_timeout on > that osd in order to trim more at a time. > > > On 06/21/2017 09:27 AM, Dan van der Ster wrote: >> >> Hi Casey, >> >> I managed to trim up all shards except for that big #54. The others >> all trimmed within a few seconds. >> >> But 54 is proving difficult. It's still going after several days, and >> now I see that the 1000-key trim is indeed causing osd timeouts. I've >> manually compacted the relevant osd leveldbs, but haven't found any >> way to speed up the trimming. It's now going at ~1-2Hz, so 1000 trims >> per op locks things up for quite awhile. >> >> I'm thinking of running those ceph-osd's with this patch: >> >> # git diff >> diff --git a/src/cls/log/cls_log.cc b/src/cls/log/cls_log.cc >> index 89745bb..7dcd933 100644 >> --- a/src/cls/log/cls_log.cc >> +++ b/src/cls/log/cls_log.cc >> @@ -254,7 +254,7 @@ static int cls_log_trim(cls_method_context_t hctx, >> bufferlist *in, bufferlist *o >> to_index = op.to_marker; >> } >> >> -#define MAX_TRIM_ENTRIES 1000 >> +#define MAX_TRIM_ENTRIES 10 >> size_t max_entries = MAX_TRIM_ENTRIES; >> >> int rc = cls_cxx_map_get_vals(hctx, from_index, log_index_prefix, >> max_entries, ); >> >> >> What do you think? >> >> -- Dan >> >> >> >> >> On Mon, Jun 19, 2017 at 5:32 PM, Casey Bodley wrote: >>> >>> Hi Dan, >>> >>> That's good news that it can remove 1000 keys at a time without hitting >>> timeouts. The output of 'du' will depend on when the leveldb compaction >>> runs. If you do find that compaction leads to suicide timeouts on this >>> osd >>> (you would see a lot of 'leveldb:' output in the log), consider running >>> offline compaction by adding 'leveldb compact on mount = true' to the osd >>> config and restarting. >>> >>> Casey >>> >>> >>> On 06/19/2017 11:01 AM, Dan van der Ster wrote: On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodley wrote: > > On 06/14/2017 05:59 AM, Dan van der Ster wrote: >> >> Dear ceph users, >> >> Today we had O(100) slow requests which were caused by deep-scrubbing >> of the metadata log: >> >> 2017-06-14 11:07:55.373184 osd.155 >> [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d >> deep-scrub starts >> ... >> 2017-06-14 11:22:04.143903 osd.155 >> [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow >> request 480.140904 seconds old, received at 2017-06-14 >> 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d >> meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc >> 0=[] ondisk+write+known_if_redirected e7752) currently waiting for >> scrub >> ... >> 2017-06-14 11:22:06.729306 osd.155 >> [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d >> deep-scrub ok >> >> We have log_meta: true, log_data: false on this (our only) region [1], >> which IIRC we setup to enable indexless buckets. >> >> I'm obviously unfamiliar with rgw meta and data logging, and have a >> few questions: >> >> 1. AFAIU, it is used by the rgw multisite feature. Is it safe to >> turn >> it off when not using multisite? > > > It's a good idea to turn that off, yes. > > First, make sure that you have configured a default > realm/zonegroup/zone: > > $ radosgw-admin realm default --rgw-realm (you can > determine > realm name from 'radosgw-admin realm list') > $ radosgw-admin zonegroup default --rgw-zonegroup default > $ radosgw-admin zone default --rgw-zone default > Thanks. This had already been done, as confirmed with radosgw-admin realm get-default. > Then you can modify the zonegroup (aka region): > > $ radosgw-admin zonegroup get > zonegroup.json > $ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json > $ radosgw-admin zonegroup set < zonegroup.json > > Then commit the updated period configuration: > > $ radosgw-admin period update --commit > > Verify that the resulting period contains "log_meta": "false". Take > care > with future radosgw-admin commands on the zone/zonegroup, as they may > revert > log_meta back to true [1]. > Great, this worked. FYI (and for others trying this in future), the period update --commit blocks all rgws for ~30s while they reload the realm.
Re: [ceph-users] radosgw: scrub causing slow requests in the md log
On 06/14/17 11:59, Dan van der Ster wrote: > Dear ceph users, > > Today we had O(100) slow requests which were caused by deep-scrubbing > of the metadata log: > > 2017-06-14 11:07:55.373184 osd.155 > [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d > deep-scrub starts > ... > 2017-06-14 11:22:04.143903 osd.155 > [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow > request 480.140904 seconds old, received at 2017-06-14 > 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d > meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc > 0=[] ondisk+write+known_if_redirected e7752) currently waiting for > scrub > ... > 2017-06-14 11:22:06.729306 osd.155 > [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d > deep-scrub ok This looks just like my problem in my thread on ceph-devel "another scrub bug? blocked for > 10240.948831 secs​" except that your scrub eventually finished (mine ran hours before I stopped it manually), and I'm not using rgw. Sage commented that it is likely related to snaps being removed at some point and interacting with scrub. Restarting the osd that is mentioned there (osd.155 in your case) will fix it for now. And tuning scrub changes the way it behaves (defaults make it happen more rarely than what I had before). -- Peter Maloney Brockmann Consult Max-Planck-Str. 2 21502 Geesthacht Germany Tel: +49 4152 889 300 Fax: +49 4152 889 333 E-mail: peter.malo...@brockmann-consult.de Internet: http://www.brockmann-consult.de ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw: scrub causing slow requests in the md log
That patch looks reasonable. You could also try raising the values of osd_op_thread_suicide_timeout and filestore_op_thread_suicide_timeout on that osd in order to trim more at a time. On 06/21/2017 09:27 AM, Dan van der Ster wrote: Hi Casey, I managed to trim up all shards except for that big #54. The others all trimmed within a few seconds. But 54 is proving difficult. It's still going after several days, and now I see that the 1000-key trim is indeed causing osd timeouts. I've manually compacted the relevant osd leveldbs, but haven't found any way to speed up the trimming. It's now going at ~1-2Hz, so 1000 trims per op locks things up for quite awhile. I'm thinking of running those ceph-osd's with this patch: # git diff diff --git a/src/cls/log/cls_log.cc b/src/cls/log/cls_log.cc index 89745bb..7dcd933 100644 --- a/src/cls/log/cls_log.cc +++ b/src/cls/log/cls_log.cc @@ -254,7 +254,7 @@ static int cls_log_trim(cls_method_context_t hctx, bufferlist *in, bufferlist *o to_index = op.to_marker; } -#define MAX_TRIM_ENTRIES 1000 +#define MAX_TRIM_ENTRIES 10 size_t max_entries = MAX_TRIM_ENTRIES; int rc = cls_cxx_map_get_vals(hctx, from_index, log_index_prefix, max_entries, ); What do you think? -- Dan On Mon, Jun 19, 2017 at 5:32 PM, Casey Bodleywrote: Hi Dan, That's good news that it can remove 1000 keys at a time without hitting timeouts. The output of 'du' will depend on when the leveldb compaction runs. If you do find that compaction leads to suicide timeouts on this osd (you would see a lot of 'leveldb:' output in the log), consider running offline compaction by adding 'leveldb compact on mount = true' to the osd config and restarting. Casey On 06/19/2017 11:01 AM, Dan van der Ster wrote: On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodley wrote: On 06/14/2017 05:59 AM, Dan van der Ster wrote: Dear ceph users, Today we had O(100) slow requests which were caused by deep-scrubbing of the metadata log: 2017-06-14 11:07:55.373184 osd.155 [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d deep-scrub starts ... 2017-06-14 11:22:04.143903 osd.155 [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow request 480.140904 seconds old, received at 2017-06-14 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc 0=[] ondisk+write+known_if_redirected e7752) currently waiting for scrub ... 2017-06-14 11:22:06.729306 osd.155 [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d deep-scrub ok We have log_meta: true, log_data: false on this (our only) region [1], which IIRC we setup to enable indexless buckets. I'm obviously unfamiliar with rgw meta and data logging, and have a few questions: 1. AFAIU, it is used by the rgw multisite feature. Is it safe to turn it off when not using multisite? It's a good idea to turn that off, yes. First, make sure that you have configured a default realm/zonegroup/zone: $ radosgw-admin realm default --rgw-realm (you can determine realm name from 'radosgw-admin realm list') $ radosgw-admin zonegroup default --rgw-zonegroup default $ radosgw-admin zone default --rgw-zone default Thanks. This had already been done, as confirmed with radosgw-admin realm get-default. Then you can modify the zonegroup (aka region): $ radosgw-admin zonegroup get > zonegroup.json $ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json $ radosgw-admin zonegroup set < zonegroup.json Then commit the updated period configuration: $ radosgw-admin period update --commit Verify that the resulting period contains "log_meta": "false". Take care with future radosgw-admin commands on the zone/zonegroup, as they may revert log_meta back to true [1]. Great, this worked. FYI (and for others trying this in future), the period update --commit blocks all rgws for ~30s while they reload the realm. 2. I started dumping the output of radosgw-admin mdlog list, and cancelled it after a few minutes. It had already dumped 3GB of json and I don't know how much more it would have written. Is something supposed to be trimming the mdlog automatically? There is automated mdlog trimming logic in master, but not jewel/kraken. And this logic won't be triggered if there is only one zone [2]. 3. ceph df doesn't show the space occupied by omap objects -- is there an indirect way to see how much space these are using? You can inspect the osd's omap directory: du -sh /var/lib/ceph/osd/osd0/current/omap Cool. osd.155 (holding shard 54) has 3.3GB of omap, compared with ~100-300MB on other OSDs. 4. mdlog status has markers going back to 2016-10, see [2]. I suppose we're not using this feature correctly? :-/ 5. Suppose I were to set log_meta: false -- how would I delete these log entries now that they are not needed? There is a 'radosgw-admin mdlog trim' command that can be used to trim them one
Re: [ceph-users] radosgw: scrub causing slow requests in the md log
Hi Casey, I managed to trim up all shards except for that big #54. The others all trimmed within a few seconds. But 54 is proving difficult. It's still going after several days, and now I see that the 1000-key trim is indeed causing osd timeouts. I've manually compacted the relevant osd leveldbs, but haven't found any way to speed up the trimming. It's now going at ~1-2Hz, so 1000 trims per op locks things up for quite awhile. I'm thinking of running those ceph-osd's with this patch: # git diff diff --git a/src/cls/log/cls_log.cc b/src/cls/log/cls_log.cc index 89745bb..7dcd933 100644 --- a/src/cls/log/cls_log.cc +++ b/src/cls/log/cls_log.cc @@ -254,7 +254,7 @@ static int cls_log_trim(cls_method_context_t hctx, bufferlist *in, bufferlist *o to_index = op.to_marker; } -#define MAX_TRIM_ENTRIES 1000 +#define MAX_TRIM_ENTRIES 10 size_t max_entries = MAX_TRIM_ENTRIES; int rc = cls_cxx_map_get_vals(hctx, from_index, log_index_prefix, max_entries, ); What do you think? -- Dan On Mon, Jun 19, 2017 at 5:32 PM, Casey Bodleywrote: > Hi Dan, > > That's good news that it can remove 1000 keys at a time without hitting > timeouts. The output of 'du' will depend on when the leveldb compaction > runs. If you do find that compaction leads to suicide timeouts on this osd > (you would see a lot of 'leveldb:' output in the log), consider running > offline compaction by adding 'leveldb compact on mount = true' to the osd > config and restarting. > > Casey > > > On 06/19/2017 11:01 AM, Dan van der Ster wrote: >> >> On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodley wrote: >>> >>> On 06/14/2017 05:59 AM, Dan van der Ster wrote: Dear ceph users, Today we had O(100) slow requests which were caused by deep-scrubbing of the metadata log: 2017-06-14 11:07:55.373184 osd.155 [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d deep-scrub starts ... 2017-06-14 11:22:04.143903 osd.155 [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow request 480.140904 seconds old, received at 2017-06-14 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc 0=[] ondisk+write+known_if_redirected e7752) currently waiting for scrub ... 2017-06-14 11:22:06.729306 osd.155 [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d deep-scrub ok We have log_meta: true, log_data: false on this (our only) region [1], which IIRC we setup to enable indexless buckets. I'm obviously unfamiliar with rgw meta and data logging, and have a few questions: 1. AFAIU, it is used by the rgw multisite feature. Is it safe to turn it off when not using multisite? >>> >>> >>> It's a good idea to turn that off, yes. >>> >>> First, make sure that you have configured a default realm/zonegroup/zone: >>> >>> $ radosgw-admin realm default --rgw-realm (you can >>> determine >>> realm name from 'radosgw-admin realm list') >>> $ radosgw-admin zonegroup default --rgw-zonegroup default >>> $ radosgw-admin zone default --rgw-zone default >>> >> Thanks. This had already been done, as confirmed with radosgw-admin >> realm get-default. >> >>> Then you can modify the zonegroup (aka region): >>> >>> $ radosgw-admin zonegroup get > zonegroup.json >>> $ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json >>> $ radosgw-admin zonegroup set < zonegroup.json >>> >>> Then commit the updated period configuration: >>> >>> $ radosgw-admin period update --commit >>> >>> Verify that the resulting period contains "log_meta": "false". Take care >>> with future radosgw-admin commands on the zone/zonegroup, as they may >>> revert >>> log_meta back to true [1]. >>> >> Great, this worked. FYI (and for others trying this in future), the >> period update --commit blocks all rgws for ~30s while they reload the >> realm. >> 2. I started dumping the output of radosgw-admin mdlog list, and cancelled it after a few minutes. It had already dumped 3GB of json and I don't know how much more it would have written. Is something supposed to be trimming the mdlog automatically? >>> >>> >>> There is automated mdlog trimming logic in master, but not jewel/kraken. >>> And >>> this logic won't be triggered if there is only one zone [2]. >>> 3. ceph df doesn't show the space occupied by omap objects -- is there an indirect way to see how much space these are using? >>> >>> >>> You can inspect the osd's omap directory: du -sh >>> /var/lib/ceph/osd/osd0/current/omap >>> >> Cool. osd.155 (holding shard 54) has 3.3GB of omap, compared with >> ~100-300MB on other OSDs. >> 4. mdlog status has markers going back to 2016-10, see [2]. I suppose we're not using this feature correctly? :-/ 5. Suppose I were to set log_meta: false -- how would I
Re: [ceph-users] radosgw: scrub causing slow requests in the md log
Hi Dan, That's good news that it can remove 1000 keys at a time without hitting timeouts. The output of 'du' will depend on when the leveldb compaction runs. If you do find that compaction leads to suicide timeouts on this osd (you would see a lot of 'leveldb:' output in the log), consider running offline compaction by adding 'leveldb compact on mount = true' to the osd config and restarting. Casey On 06/19/2017 11:01 AM, Dan van der Ster wrote: On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodleywrote: On 06/14/2017 05:59 AM, Dan van der Ster wrote: Dear ceph users, Today we had O(100) slow requests which were caused by deep-scrubbing of the metadata log: 2017-06-14 11:07:55.373184 osd.155 [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d deep-scrub starts ... 2017-06-14 11:22:04.143903 osd.155 [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow request 480.140904 seconds old, received at 2017-06-14 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc 0=[] ondisk+write+known_if_redirected e7752) currently waiting for scrub ... 2017-06-14 11:22:06.729306 osd.155 [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d deep-scrub ok We have log_meta: true, log_data: false on this (our only) region [1], which IIRC we setup to enable indexless buckets. I'm obviously unfamiliar with rgw meta and data logging, and have a few questions: 1. AFAIU, it is used by the rgw multisite feature. Is it safe to turn it off when not using multisite? It's a good idea to turn that off, yes. First, make sure that you have configured a default realm/zonegroup/zone: $ radosgw-admin realm default --rgw-realm (you can determine realm name from 'radosgw-admin realm list') $ radosgw-admin zonegroup default --rgw-zonegroup default $ radosgw-admin zone default --rgw-zone default Thanks. This had already been done, as confirmed with radosgw-admin realm get-default. Then you can modify the zonegroup (aka region): $ radosgw-admin zonegroup get > zonegroup.json $ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json $ radosgw-admin zonegroup set < zonegroup.json Then commit the updated period configuration: $ radosgw-admin period update --commit Verify that the resulting period contains "log_meta": "false". Take care with future radosgw-admin commands on the zone/zonegroup, as they may revert log_meta back to true [1]. Great, this worked. FYI (and for others trying this in future), the period update --commit blocks all rgws for ~30s while they reload the realm. 2. I started dumping the output of radosgw-admin mdlog list, and cancelled it after a few minutes. It had already dumped 3GB of json and I don't know how much more it would have written. Is something supposed to be trimming the mdlog automatically? There is automated mdlog trimming logic in master, but not jewel/kraken. And this logic won't be triggered if there is only one zone [2]. 3. ceph df doesn't show the space occupied by omap objects -- is there an indirect way to see how much space these are using? You can inspect the osd's omap directory: du -sh /var/lib/ceph/osd/osd0/current/omap Cool. osd.155 (holding shard 54) has 3.3GB of omap, compared with ~100-300MB on other OSDs. 4. mdlog status has markers going back to 2016-10, see [2]. I suppose we're not using this feature correctly? :-/ 5. Suppose I were to set log_meta: false -- how would I delete these log entries now that they are not needed? There is a 'radosgw-admin mdlog trim' command that can be used to trim them one --shard-id (from 0 to 63) at a time. An entire log shard can be trimmed with: $ radosgw-admin mdlog trim --shard-id 0 --period 8d4fcb63-c314-4f9a-b3b3-0e61719ec258 --end-time 2020-1-1 *However*, there is a risk that bulk operations on large omaps will affect cluster health by taking down OSDs. Not only can this bulk deletion take long enough to trigger the osd/filestore suicide timeouts, the resulting leveldb compaction after deletion is likely to block other omap operations and hit the timeouts as well. This seems likely in your case, based on the fact that you're already having issues with scrub. We did this directly on shard 54, and indeed the command is taking a looong time (but with no slow requests or osds being marked down). After 45 minutes, du is still 3.3GB, so I can't tell if it's progressing. I see ~1000 _omap_rmkeys messages every ~2 seconds: 2017-06-19 16:57:34.347222 7fc602640700 15 filestore(/var/lib/ceph/osd/ceph-155) _omap_rmkeys 24.1d_head/#24:ba0cd17d:::met a.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:head# 2017-06-19 16:57:34.347319 7fc602640700 10 filestore oid: #24:ba0cd17d:::meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:h ead# not skipping op, *spos 67765185.0.0 2017-06-19 16:57:34.347326 7fc602640700 10 filestore > header.spos 0.0.0 2017-06-19 16:57:34.347351 7fc602640700
Re: [ceph-users] radosgw: scrub causing slow requests in the md log
On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodleywrote: > > On 06/14/2017 05:59 AM, Dan van der Ster wrote: >> >> Dear ceph users, >> >> Today we had O(100) slow requests which were caused by deep-scrubbing >> of the metadata log: >> >> 2017-06-14 11:07:55.373184 osd.155 >> [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d >> deep-scrub starts >> ... >> 2017-06-14 11:22:04.143903 osd.155 >> [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow >> request 480.140904 seconds old, received at 2017-06-14 >> 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d >> meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc >> 0=[] ondisk+write+known_if_redirected e7752) currently waiting for >> scrub >> ... >> 2017-06-14 11:22:06.729306 osd.155 >> [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d >> deep-scrub ok >> >> We have log_meta: true, log_data: false on this (our only) region [1], >> which IIRC we setup to enable indexless buckets. >> >> I'm obviously unfamiliar with rgw meta and data logging, and have a >> few questions: >> >> 1. AFAIU, it is used by the rgw multisite feature. Is it safe to turn >> it off when not using multisite? > > > It's a good idea to turn that off, yes. > > First, make sure that you have configured a default realm/zonegroup/zone: > > $ radosgw-admin realm default --rgw-realm (you can determine > realm name from 'radosgw-admin realm list') > $ radosgw-admin zonegroup default --rgw-zonegroup default > $ radosgw-admin zone default --rgw-zone default > Thanks. This had already been done, as confirmed with radosgw-admin realm get-default. > Then you can modify the zonegroup (aka region): > > $ radosgw-admin zonegroup get > zonegroup.json > $ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json > $ radosgw-admin zonegroup set < zonegroup.json > > Then commit the updated period configuration: > > $ radosgw-admin period update --commit > > Verify that the resulting period contains "log_meta": "false". Take care > with future radosgw-admin commands on the zone/zonegroup, as they may revert > log_meta back to true [1]. > Great, this worked. FYI (and for others trying this in future), the period update --commit blocks all rgws for ~30s while they reload the realm. >> >> 2. I started dumping the output of radosgw-admin mdlog list, and >> cancelled it after a few minutes. It had already dumped 3GB of json >> and I don't know how much more it would have written. Is something >> supposed to be trimming the mdlog automatically? > > > There is automated mdlog trimming logic in master, but not jewel/kraken. And > this logic won't be triggered if there is only one zone [2]. > >> >> 3. ceph df doesn't show the space occupied by omap objects -- is >> there an indirect way to see how much space these are using? > > > You can inspect the osd's omap directory: du -sh > /var/lib/ceph/osd/osd0/current/omap > Cool. osd.155 (holding shard 54) has 3.3GB of omap, compared with ~100-300MB on other OSDs. >> 4. mdlog status has markers going back to 2016-10, see [2]. I suppose >> we're not using this feature correctly? :-/ >> >> 5. Suppose I were to set log_meta: false -- how would I delete these >> log entries now that they are not needed? > > > There is a 'radosgw-admin mdlog trim' command that can be used to trim them > one --shard-id (from 0 to 63) at a time. An entire log shard can be trimmed > with: > > $ radosgw-admin mdlog trim --shard-id 0 --period > 8d4fcb63-c314-4f9a-b3b3-0e61719ec258 --end-time 2020-1-1 > > *However*, there is a risk that bulk operations on large omaps will affect > cluster health by taking down OSDs. Not only can this bulk deletion take > long enough to trigger the osd/filestore suicide timeouts, the resulting > leveldb compaction after deletion is likely to block other omap operations > and hit the timeouts as well. This seems likely in your case, based on the > fact that you're already having issues with scrub. We did this directly on shard 54, and indeed the command is taking a looong time (but with no slow requests or osds being marked down). After 45 minutes, du is still 3.3GB, so I can't tell if it's progressing. I see ~1000 _omap_rmkeys messages every ~2 seconds: 2017-06-19 16:57:34.347222 7fc602640700 15 filestore(/var/lib/ceph/osd/ceph-155) _omap_rmkeys 24.1d_head/#24:ba0cd17d:::met a.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:head# 2017-06-19 16:57:34.347319 7fc602640700 10 filestore oid: #24:ba0cd17d:::meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:h ead# not skipping op, *spos 67765185.0.0 2017-06-19 16:57:34.347326 7fc602640700 10 filestore > header.spos 0.0.0 2017-06-19 16:57:34.347351 7fc602640700 15 filestore(/var/lib/ceph/osd/ceph-155) _omap_rmkeys 24.1d_head/#24:ba0cd17d:::met a.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:head# 2017-06-19 16:57:34.347373 7fc602640700 10 filestore oid: #24:ba0cd17d:::meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:h ead#
Re: [ceph-users] radosgw: scrub causing slow requests in the md log
On 06/14/2017 05:59 AM, Dan van der Ster wrote: Dear ceph users, Today we had O(100) slow requests which were caused by deep-scrubbing of the metadata log: 2017-06-14 11:07:55.373184 osd.155 [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d deep-scrub starts ... 2017-06-14 11:22:04.143903 osd.155 [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow request 480.140904 seconds old, received at 2017-06-14 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc 0=[] ondisk+write+known_if_redirected e7752) currently waiting for scrub ... 2017-06-14 11:22:06.729306 osd.155 [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d deep-scrub ok We have log_meta: true, log_data: false on this (our only) region [1], which IIRC we setup to enable indexless buckets. I'm obviously unfamiliar with rgw meta and data logging, and have a few questions: 1. AFAIU, it is used by the rgw multisite feature. Is it safe to turn it off when not using multisite? It's a good idea to turn that off, yes. First, make sure that you have configured a default realm/zonegroup/zone: $ radosgw-admin realm default --rgw-realm (you can determine realm name from 'radosgw-admin realm list') $ radosgw-admin zonegroup default --rgw-zonegroup default $ radosgw-admin zone default --rgw-zone default Then you can modify the zonegroup (aka region): $ radosgw-admin zonegroup get > zonegroup.json $ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json $ radosgw-admin zonegroup set < zonegroup.json Then commit the updated period configuration: $ radosgw-admin period update --commit Verify that the resulting period contains "log_meta": "false". Take care with future radosgw-admin commands on the zone/zonegroup, as they may revert log_meta back to true [1]. 2. I started dumping the output of radosgw-admin mdlog list, and cancelled it after a few minutes. It had already dumped 3GB of json and I don't know how much more it would have written. Is something supposed to be trimming the mdlog automatically? There is automated mdlog trimming logic in master, but not jewel/kraken. And this logic won't be triggered if there is only one zone [2]. 3. ceph df doesn't show the space occupied by omap objects -- is there an indirect way to see how much space these are using? You can inspect the osd's omap directory: du -sh /var/lib/ceph/osd/osd0/current/omap 4. mdlog status has markers going back to 2016-10, see [2]. I suppose we're not using this feature correctly? :-/ 5. Suppose I were to set log_meta: false -- how would I delete these log entries now that they are not needed? There is a 'radosgw-admin mdlog trim' command that can be used to trim them one --shard-id (from 0 to 63) at a time. An entire log shard can be trimmed with: $ radosgw-admin mdlog trim --shard-id 0 --period 8d4fcb63-c314-4f9a-b3b3-0e61719ec258 --end-time 2020-1-1 *However*, there is a risk that bulk operations on large omaps will affect cluster health by taking down OSDs. Not only can this bulk deletion take long enough to trigger the osd/filestore suicide timeouts, the resulting leveldb compaction after deletion is likely to block other omap operations and hit the timeouts as well. This seems likely in your case, based on the fact that you're already having issues with scrub. Apologies if there are already good docs about this, which eluded my googling. Best Regards, Dan [1] region get: { "id": "61c0ff1a-4330-405a-9eb1-bb494d4daf82", "name": "default", "api_name": "default", "is_master": "true", "endpoints": [], "hostnames": [], "hostnames_s3website": [], "master_zone": "61c59385-085d-4caa-9070-63a3868dccb6", "zones": [ { "id": "61c59385-085d-4caa-9070-63a3868dccb6", "name": "default", "endpoints": [], "log_meta": "true", "log_data": "false", "bucket_index_max_shards": 32, "read_only": "false" } ], "placement_targets": [ { "name": "default-placement", "tags": [] }, { "name": "indexless", "tags": [] } ], "default_placement": "default-placement", "realm_id": "552868ad-8898-4afb-a775-911297961cee" } [2] mdlog status: No --period given, using current period=8d4fcb63-c314-4f9a-b3b3-0e61719ec258 [ ... { "marker": "1_1475568296.712634_3.1", "last_update": "2016-10-04 08:04:56.712634Z" }, ... ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [1] http://tracker.ceph.com/issues/20320 [2] http://tracker.ceph.com/issues/20319 ___ ceph-users mailing list ceph-users@lists.ceph.com