Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-23 Thread Dan van der Ster
On Thu, Jun 22, 2017 at 5:31 PM, Casey Bodley  wrote:
>
> On 06/22/2017 10:40 AM, Dan van der Ster wrote:
>>
>> On Thu, Jun 22, 2017 at 4:25 PM, Casey Bodley  wrote:
>>>
>>> On 06/22/2017 04:00 AM, Dan van der Ster wrote:

 I'm now running the three relevant OSDs with that patch. (Recompiled,
 replaced /usr/lib64/rados-classes/libcls_log.so with the new version,
 then restarted the osds).

 It's working quite well, trimming 10 entries at a time instead of
 1000, and no more timeouts.

 Do you think it would be worth decreasing this hardcoded value in ceph
 proper?

 -- Dan
>>>
>>>
>>> I do, yeah. At least, the trim operation should be able to pass in its
>>> own
>>> value for that. I opened a ticket for that at
>>> http://tracker.ceph.com/issues/20382.
>>>
>>> I'd also like to investigate using the ObjectStore's OP_OMAP_RMKEYRANGE
>>> operation to trim a range of keys in a single osd op, instead of
>>> generating
>>> a different op for each key. I have a PR that does this at
>>> https://github.com/ceph/ceph/pull/15183. But it's still hard to guarantee
>>> that leveldb can process the entire range inside of the suicide timeout.
>>
>> I wonder if that would help. Here's what I've learned today:
>>
>>* two of the 3 relevant OSDs have something screwy with their
>> leveldb. The primary and 3rd replica are ~quick at trimming for only a
>> few hundred keys, whilst the 2nd OSD is very very fast always.
>>* After manually compacting the two slow OSDs, they are fast again
>> for just a few hundred trims. So I'm compacting, trimming, ..., in a
>> loop now.
>>* I moved the omaps to SSDs -- doesn't help. (iostat confirms this
>> is not IO bound).
>>* CPU util on the slow OSDs gets quite high during the slow trimming.
>>* perf top is below [1]. leveldb::Block::Iter::Prev and
>> leveldb::InternalKeyComparator::Compare are notable.
>>* The always fast OSD shows no leveldb functions in perf top while
>> trimming.
>>
>> I've tried bigger leveldb cache and block sizes, compression on and
>> off, and played with the bloom size up to 14 bits -- none of these
>> changes make any difference.
>>
>> At this point I'm not confident this trimming will ever complete --
>> there are ~20 million records to remove at maybe 1Hz.
>>
>> How about I just delete the meta.log object? Would this use a
>> different, perhaps quicker, code path to remove those omap keys?
>>
>> Thanks!
>>
>> Dan
>>
>> [1]
>>
>> 4.92%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
>> 0x00023e8d
>> 4.47%  libc-2.17.so [.]
>> __memcmp_sse4_1
>> 4.13%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
>> 0x000273bb
>> 3.81%  libleveldb.so.1.0.7  [.]
>> leveldb::Block::Iter::Prev
>> 3.07%  libc-2.17.so [.]
>> __memcpy_ssse3_back
>> 2.84%  [kernel] [k] port_inb
>> 2.77%  libstdc++.so.6.0.19  [.]
>> std::string::_M_mutate
>> 2.75%  libstdc++.so.6.0.19  [.]
>> std::string::append
>> 2.53%  libleveldb.so.1.0.7  [.]
>> leveldb::InternalKeyComparator::Compare
>> 1.32%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
>> 0x00023e77
>> 0.85%  [kernel] [k]
>> _raw_spin_lock
>> 0.80%  libleveldb.so.1.0.7  [.]
>> leveldb::Block::Iter::Next
>> 0.77%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
>> 0x00023a05
>> 0.67%  libleveldb.so.1.0.7  [.]
>> leveldb::MemTable::KeyComparator::operator()
>> 0.61%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
>> 0x00023a09
>> 0.58%  libleveldb.so.1.0.7  [.]
>> leveldb::MemTableIterator::Prev
>> 0.51%  [kernel] [k] __schedule
>> 0.48%  libruby.so.2.1.0 [.]
>> ruby_yyparse
>
>
> Hi Dan,
>
> Removing an object will try to delete all of its keys at once, which should
> be much faster. It's also very likely to hit your suicide timeout, so you'll
> have to keep retrying until it stops killing your osd.

Well, that was quick. The object delete took around 30s. I then
restarted the osd to compact it, and now the leveldb is ~100MB. Phew!

In summary, if someone finds themselves in this predicament (huge
mdlog on a single-region rgw cluster), I'd advise to turn it off, then
just delete the meta.log objects.

Thanks!

Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-22 Thread Casey Bodley


On 06/22/2017 10:40 AM, Dan van der Ster wrote:

On Thu, Jun 22, 2017 at 4:25 PM, Casey Bodley  wrote:

On 06/22/2017 04:00 AM, Dan van der Ster wrote:

I'm now running the three relevant OSDs with that patch. (Recompiled,
replaced /usr/lib64/rados-classes/libcls_log.so with the new version,
then restarted the osds).

It's working quite well, trimming 10 entries at a time instead of
1000, and no more timeouts.

Do you think it would be worth decreasing this hardcoded value in ceph
proper?

-- Dan


I do, yeah. At least, the trim operation should be able to pass in its own
value for that. I opened a ticket for that at
http://tracker.ceph.com/issues/20382.

I'd also like to investigate using the ObjectStore's OP_OMAP_RMKEYRANGE
operation to trim a range of keys in a single osd op, instead of generating
a different op for each key. I have a PR that does this at
https://github.com/ceph/ceph/pull/15183. But it's still hard to guarantee
that leveldb can process the entire range inside of the suicide timeout.

I wonder if that would help. Here's what I've learned today:

   * two of the 3 relevant OSDs have something screwy with their
leveldb. The primary and 3rd replica are ~quick at trimming for only a
few hundred keys, whilst the 2nd OSD is very very fast always.
   * After manually compacting the two slow OSDs, they are fast again
for just a few hundred trims. So I'm compacting, trimming, ..., in a
loop now.
   * I moved the omaps to SSDs -- doesn't help. (iostat confirms this
is not IO bound).
   * CPU util on the slow OSDs gets quite high during the slow trimming.
   * perf top is below [1]. leveldb::Block::Iter::Prev and
leveldb::InternalKeyComparator::Compare are notable.
   * The always fast OSD shows no leveldb functions in perf top while trimming.

I've tried bigger leveldb cache and block sizes, compression on and
off, and played with the bloom size up to 14 bits -- none of these
changes make any difference.

At this point I'm not confident this trimming will ever complete --
there are ~20 million records to remove at maybe 1Hz.

How about I just delete the meta.log object? Would this use a
different, perhaps quicker, code path to remove those omap keys?

Thanks!

Dan

[1]

4.92%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
0x00023e8d
4.47%  libc-2.17.so [.] __memcmp_sse4_1
4.13%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
0x000273bb
3.81%  libleveldb.so.1.0.7  [.]
leveldb::Block::Iter::Prev
3.07%  libc-2.17.so [.]
__memcpy_ssse3_back
2.84%  [kernel] [k] port_inb
2.77%  libstdc++.so.6.0.19  [.]
std::string::_M_mutate
2.75%  libstdc++.so.6.0.19  [.]
std::string::append
2.53%  libleveldb.so.1.0.7  [.]
leveldb::InternalKeyComparator::Compare
1.32%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
0x00023e77
0.85%  [kernel] [k] _raw_spin_lock
0.80%  libleveldb.so.1.0.7  [.]
leveldb::Block::Iter::Next
0.77%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
0x00023a05
0.67%  libleveldb.so.1.0.7  [.]
leveldb::MemTable::KeyComparator::operator()
0.61%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
0x00023a09
0.58%  libleveldb.so.1.0.7  [.]
leveldb::MemTableIterator::Prev
0.51%  [kernel] [k] __schedule
0.48%  libruby.so.2.1.0 [.] ruby_yyparse


Hi Dan,

Removing an object will try to delete all of its keys at once, which 
should be much faster. It's also very likely to hit your suicide 
timeout, so you'll have to keep retrying until it stops killing your osd.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-22 Thread Dan van der Ster
On Thu, Jun 22, 2017 at 4:25 PM, Casey Bodley  wrote:
>
> On 06/22/2017 04:00 AM, Dan van der Ster wrote:
>>
>> I'm now running the three relevant OSDs with that patch. (Recompiled,
>> replaced /usr/lib64/rados-classes/libcls_log.so with the new version,
>> then restarted the osds).
>>
>> It's working quite well, trimming 10 entries at a time instead of
>> 1000, and no more timeouts.
>>
>> Do you think it would be worth decreasing this hardcoded value in ceph
>> proper?
>>
>> -- Dan
>
>
> I do, yeah. At least, the trim operation should be able to pass in its own
> value for that. I opened a ticket for that at
> http://tracker.ceph.com/issues/20382.
>
> I'd also like to investigate using the ObjectStore's OP_OMAP_RMKEYRANGE
> operation to trim a range of keys in a single osd op, instead of generating
> a different op for each key. I have a PR that does this at
> https://github.com/ceph/ceph/pull/15183. But it's still hard to guarantee
> that leveldb can process the entire range inside of the suicide timeout.

I wonder if that would help. Here's what I've learned today:

  * two of the 3 relevant OSDs have something screwy with their
leveldb. The primary and 3rd replica are ~quick at trimming for only a
few hundred keys, whilst the 2nd OSD is very very fast always.
  * After manually compacting the two slow OSDs, they are fast again
for just a few hundred trims. So I'm compacting, trimming, ..., in a
loop now.
  * I moved the omaps to SSDs -- doesn't help. (iostat confirms this
is not IO bound).
  * CPU util on the slow OSDs gets quite high during the slow trimming.
  * perf top is below [1]. leveldb::Block::Iter::Prev and
leveldb::InternalKeyComparator::Compare are notable.
  * The always fast OSD shows no leveldb functions in perf top while trimming.

I've tried bigger leveldb cache and block sizes, compression on and
off, and played with the bloom size up to 14 bits -- none of these
changes make any difference.

At this point I'm not confident this trimming will ever complete --
there are ~20 million records to remove at maybe 1Hz.

How about I just delete the meta.log object? Would this use a
different, perhaps quicker, code path to remove those omap keys?

Thanks!

Dan

[1]

   4.92%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
0x00023e8d
   4.47%  libc-2.17.so [.] __memcmp_sse4_1
   4.13%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
0x000273bb
   3.81%  libleveldb.so.1.0.7  [.]
leveldb::Block::Iter::Prev
   3.07%  libc-2.17.so [.]
__memcpy_ssse3_back
   2.84%  [kernel] [k] port_inb
   2.77%  libstdc++.so.6.0.19  [.]
std::string::_M_mutate
   2.75%  libstdc++.so.6.0.19  [.]
std::string::append
   2.53%  libleveldb.so.1.0.7  [.]
leveldb::InternalKeyComparator::Compare
   1.32%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
0x00023e77
   0.85%  [kernel] [k] _raw_spin_lock
   0.80%  libleveldb.so.1.0.7  [.]
leveldb::Block::Iter::Next
   0.77%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
0x00023a05
   0.67%  libleveldb.so.1.0.7  [.]
leveldb::MemTable::KeyComparator::operator()
   0.61%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
0x00023a09
   0.58%  libleveldb.so.1.0.7  [.]
leveldb::MemTableIterator::Prev
   0.51%  [kernel] [k] __schedule
   0.48%  libruby.so.2.1.0 [.] ruby_yyparse
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-22 Thread Casey Bodley


On 06/22/2017 04:00 AM, Dan van der Ster wrote:

I'm now running the three relevant OSDs with that patch. (Recompiled,
replaced /usr/lib64/rados-classes/libcls_log.so with the new version,
then restarted the osds).

It's working quite well, trimming 10 entries at a time instead of
1000, and no more timeouts.

Do you think it would be worth decreasing this hardcoded value in ceph proper?

-- Dan


I do, yeah. At least, the trim operation should be able to pass in its 
own value for that. I opened a ticket for that at 
http://tracker.ceph.com/issues/20382.


I'd also like to investigate using the ObjectStore's OP_OMAP_RMKEYRANGE 
operation to trim a range of keys in a single osd op, instead of 
generating a different op for each key. I have a PR that does this at 
https://github.com/ceph/ceph/pull/15183. But it's still hard to 
guarantee that leveldb can process the entire range inside of the 
suicide timeout.


Casey




On Wed, Jun 21, 2017 at 3:51 PM, Casey Bodley  wrote:

That patch looks reasonable. You could also try raising the values of
osd_op_thread_suicide_timeout and filestore_op_thread_suicide_timeout on
that osd in order to trim more at a time.


On 06/21/2017 09:27 AM, Dan van der Ster wrote:

Hi Casey,

I managed to trim up all shards except for that big #54. The others
all trimmed within a few seconds.

But 54 is proving difficult. It's still going after several days, and
now I see that the 1000-key trim is indeed causing osd timeouts. I've
manually compacted the relevant osd leveldbs, but haven't found any
way to speed up the trimming. It's now going at ~1-2Hz, so 1000 trims
per op locks things up for quite awhile.

I'm thinking of running those ceph-osd's with this patch:

# git diff
diff --git a/src/cls/log/cls_log.cc b/src/cls/log/cls_log.cc
index 89745bb..7dcd933 100644
--- a/src/cls/log/cls_log.cc
+++ b/src/cls/log/cls_log.cc
@@ -254,7 +254,7 @@ static int cls_log_trim(cls_method_context_t hctx,
bufferlist *in, bufferlist *o
   to_index = op.to_marker;
 }

-#define MAX_TRIM_ENTRIES 1000
+#define MAX_TRIM_ENTRIES 10
 size_t max_entries = MAX_TRIM_ENTRIES;

 int rc = cls_cxx_map_get_vals(hctx, from_index, log_index_prefix,
max_entries, );


What do you think?

-- Dan




On Mon, Jun 19, 2017 at 5:32 PM, Casey Bodley  wrote:

Hi Dan,

That's good news that it can remove 1000 keys at a time without hitting
timeouts. The output of 'du' will depend on when the leveldb compaction
runs. If you do find that compaction leads to suicide timeouts on this
osd
(you would see a lot of 'leveldb:' output in the log), consider running
offline compaction by adding 'leveldb compact on mount = true' to the osd
config and restarting.

Casey


On 06/19/2017 11:01 AM, Dan van der Ster wrote:

On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodley 
wrote:

On 06/14/2017 05:59 AM, Dan van der Ster wrote:

Dear ceph users,

Today we had O(100) slow requests which were caused by deep-scrubbing
of the metadata log:

2017-06-14 11:07:55.373184 osd.155
[2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d
deep-scrub starts
...
2017-06-14 11:22:04.143903 osd.155
[2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow
request 480.140904 seconds old, received at 2017-06-14
11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d
meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc
0=[] ondisk+write+known_if_redirected e7752) currently waiting for
scrub
...
2017-06-14 11:22:06.729306 osd.155
[2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d
deep-scrub ok

We have log_meta: true, log_data: false on this (our only) region [1],
which IIRC we setup to enable indexless buckets.

I'm obviously unfamiliar with rgw meta and data logging, and have a
few questions:

 1. AFAIU, it is used by the rgw multisite feature. Is it safe to
turn
it off when not using multisite?


It's a good idea to turn that off, yes.

First, make sure that you have configured a default
realm/zonegroup/zone:

$ radosgw-admin realm default --rgw-realm   (you can
determine
realm name from 'radosgw-admin realm list')
$ radosgw-admin zonegroup default --rgw-zonegroup default
$ radosgw-admin zone default --rgw-zone default


Thanks. This had already been done, as confirmed with radosgw-admin
realm get-default.


Then you can modify the zonegroup (aka region):

$ radosgw-admin zonegroup get > zonegroup.json
$ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json
$ radosgw-admin zonegroup set < zonegroup.json

Then commit the updated period configuration:

$ radosgw-admin period update --commit

Verify that the resulting period contains "log_meta": "false". Take
care
with future radosgw-admin commands on the zone/zonegroup, as they may
revert
log_meta back to true [1].


Great, this worked. FYI (and for others trying this in future), the
period update --commit blocks all rgws for ~30s while they reload the
realm.


 2. I 

Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-22 Thread Dan van der Ster
On Wed, Jun 21, 2017 at 4:16 PM, Peter Maloney
 wrote:
> On 06/14/17 11:59, Dan van der Ster wrote:
>> Dear ceph users,
>>
>> Today we had O(100) slow requests which were caused by deep-scrubbing
>> of the metadata log:
>>
>> 2017-06-14 11:07:55.373184 osd.155
>> [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d
>> deep-scrub starts
>> ...
>> 2017-06-14 11:22:04.143903 osd.155
>> [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow
>> request 480.140904 seconds old, received at 2017-06-14
>> 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d
>> meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc
>> 0=[] ondisk+write+known_if_redirected e7752) currently waiting for
>> scrub
>> ...
>> 2017-06-14 11:22:06.729306 osd.155
>> [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d
>> deep-scrub ok
>
> This looks just like my problem in my thread on ceph-devel "another
> scrub bug? blocked for > 10240.948831 secs" except that your scrub
> eventually finished (mine ran hours before I stopped it manually), and
> I'm not using rgw.
>
> Sage commented that it is likely related to snaps being removed at some
> point and interacting with scrub.
>
> Restarting the osd that is mentioned there (osd.155 in  your case) will
> fix it for now. And tuning scrub changes the way it behaves (defaults
> make it happen more rarely than what I had before).

In my case it's not related to snaps -- there are no snaps (or
trimming) in a (normal) set of rgw pools.

My problem is about the cls_log class, which tries to do a lot of work
in one op, timing out the osds.

Well, the *real* problem in my case is about this rgw mdlog, which can
grow unboundedly, then eventually become un-scrubbable, leading to
this huge amount of cleanup to be done.

-- dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-22 Thread Dan van der Ster
I'm now running the three relevant OSDs with that patch. (Recompiled,
replaced /usr/lib64/rados-classes/libcls_log.so with the new version,
then restarted the osds).

It's working quite well, trimming 10 entries at a time instead of
1000, and no more timeouts.

Do you think it would be worth decreasing this hardcoded value in ceph proper?

-- Dan


On Wed, Jun 21, 2017 at 3:51 PM, Casey Bodley  wrote:
> That patch looks reasonable. You could also try raising the values of
> osd_op_thread_suicide_timeout and filestore_op_thread_suicide_timeout on
> that osd in order to trim more at a time.
>
>
> On 06/21/2017 09:27 AM, Dan van der Ster wrote:
>>
>> Hi Casey,
>>
>> I managed to trim up all shards except for that big #54. The others
>> all trimmed within a few seconds.
>>
>> But 54 is proving difficult. It's still going after several days, and
>> now I see that the 1000-key trim is indeed causing osd timeouts. I've
>> manually compacted the relevant osd leveldbs, but haven't found any
>> way to speed up the trimming. It's now going at ~1-2Hz, so 1000 trims
>> per op locks things up for quite awhile.
>>
>> I'm thinking of running those ceph-osd's with this patch:
>>
>> # git diff
>> diff --git a/src/cls/log/cls_log.cc b/src/cls/log/cls_log.cc
>> index 89745bb..7dcd933 100644
>> --- a/src/cls/log/cls_log.cc
>> +++ b/src/cls/log/cls_log.cc
>> @@ -254,7 +254,7 @@ static int cls_log_trim(cls_method_context_t hctx,
>> bufferlist *in, bufferlist *o
>>   to_index = op.to_marker;
>> }
>>
>> -#define MAX_TRIM_ENTRIES 1000
>> +#define MAX_TRIM_ENTRIES 10
>> size_t max_entries = MAX_TRIM_ENTRIES;
>>
>> int rc = cls_cxx_map_get_vals(hctx, from_index, log_index_prefix,
>> max_entries, );
>>
>>
>> What do you think?
>>
>> -- Dan
>>
>>
>>
>>
>> On Mon, Jun 19, 2017 at 5:32 PM, Casey Bodley  wrote:
>>>
>>> Hi Dan,
>>>
>>> That's good news that it can remove 1000 keys at a time without hitting
>>> timeouts. The output of 'du' will depend on when the leveldb compaction
>>> runs. If you do find that compaction leads to suicide timeouts on this
>>> osd
>>> (you would see a lot of 'leveldb:' output in the log), consider running
>>> offline compaction by adding 'leveldb compact on mount = true' to the osd
>>> config and restarting.
>>>
>>> Casey
>>>
>>>
>>> On 06/19/2017 11:01 AM, Dan van der Ster wrote:

 On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodley 
 wrote:
>
> On 06/14/2017 05:59 AM, Dan van der Ster wrote:
>>
>> Dear ceph users,
>>
>> Today we had O(100) slow requests which were caused by deep-scrubbing
>> of the metadata log:
>>
>> 2017-06-14 11:07:55.373184 osd.155
>> [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d
>> deep-scrub starts
>> ...
>> 2017-06-14 11:22:04.143903 osd.155
>> [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow
>> request 480.140904 seconds old, received at 2017-06-14
>> 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d
>> meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc
>> 0=[] ondisk+write+known_if_redirected e7752) currently waiting for
>> scrub
>> ...
>> 2017-06-14 11:22:06.729306 osd.155
>> [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d
>> deep-scrub ok
>>
>> We have log_meta: true, log_data: false on this (our only) region [1],
>> which IIRC we setup to enable indexless buckets.
>>
>> I'm obviously unfamiliar with rgw meta and data logging, and have a
>> few questions:
>>
>> 1. AFAIU, it is used by the rgw multisite feature. Is it safe to
>> turn
>> it off when not using multisite?
>
>
> It's a good idea to turn that off, yes.
>
> First, make sure that you have configured a default
> realm/zonegroup/zone:
>
> $ radosgw-admin realm default --rgw-realm   (you can
> determine
> realm name from 'radosgw-admin realm list')
> $ radosgw-admin zonegroup default --rgw-zonegroup default
> $ radosgw-admin zone default --rgw-zone default
>
 Thanks. This had already been done, as confirmed with radosgw-admin
 realm get-default.

> Then you can modify the zonegroup (aka region):
>
> $ radosgw-admin zonegroup get > zonegroup.json
> $ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json
> $ radosgw-admin zonegroup set < zonegroup.json
>
> Then commit the updated period configuration:
>
> $ radosgw-admin period update --commit
>
> Verify that the resulting period contains "log_meta": "false". Take
> care
> with future radosgw-admin commands on the zone/zonegroup, as they may
> revert
> log_meta back to true [1].
>
 Great, this worked. FYI (and for others trying this in future), the
 period update --commit blocks all rgws for ~30s while they reload the
 realm.


Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-21 Thread Peter Maloney
On 06/14/17 11:59, Dan van der Ster wrote:
> Dear ceph users,
>
> Today we had O(100) slow requests which were caused by deep-scrubbing
> of the metadata log:
>
> 2017-06-14 11:07:55.373184 osd.155
> [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d
> deep-scrub starts
> ...
> 2017-06-14 11:22:04.143903 osd.155
> [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow
> request 480.140904 seconds old, received at 2017-06-14
> 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d
> meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc
> 0=[] ondisk+write+known_if_redirected e7752) currently waiting for
> scrub
> ...
> 2017-06-14 11:22:06.729306 osd.155
> [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d
> deep-scrub ok

This looks just like my problem in my thread on ceph-devel "another
scrub bug? blocked for > 10240.948831 secs​" except that your scrub
eventually finished (mine ran hours before I stopped it manually), and
I'm not using rgw.

Sage commented that it is likely related to snaps being removed at some
point and interacting with scrub.

Restarting the osd that is mentioned there (osd.155 in  your case) will
fix it for now. And tuning scrub changes the way it behaves (defaults
make it happen more rarely than what I had before).


-- 


Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
Internet: http://www.brockmann-consult.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-21 Thread Casey Bodley
That patch looks reasonable. You could also try raising the values of 
osd_op_thread_suicide_timeout and filestore_op_thread_suicide_timeout on 
that osd in order to trim more at a time.


On 06/21/2017 09:27 AM, Dan van der Ster wrote:

Hi Casey,

I managed to trim up all shards except for that big #54. The others
all trimmed within a few seconds.

But 54 is proving difficult. It's still going after several days, and
now I see that the 1000-key trim is indeed causing osd timeouts. I've
manually compacted the relevant osd leveldbs, but haven't found any
way to speed up the trimming. It's now going at ~1-2Hz, so 1000 trims
per op locks things up for quite awhile.

I'm thinking of running those ceph-osd's with this patch:

# git diff
diff --git a/src/cls/log/cls_log.cc b/src/cls/log/cls_log.cc
index 89745bb..7dcd933 100644
--- a/src/cls/log/cls_log.cc
+++ b/src/cls/log/cls_log.cc
@@ -254,7 +254,7 @@ static int cls_log_trim(cls_method_context_t hctx,
bufferlist *in, bufferlist *o
  to_index = op.to_marker;
}

-#define MAX_TRIM_ENTRIES 1000
+#define MAX_TRIM_ENTRIES 10
size_t max_entries = MAX_TRIM_ENTRIES;

int rc = cls_cxx_map_get_vals(hctx, from_index, log_index_prefix,
max_entries, );


What do you think?

-- Dan




On Mon, Jun 19, 2017 at 5:32 PM, Casey Bodley  wrote:

Hi Dan,

That's good news that it can remove 1000 keys at a time without hitting
timeouts. The output of 'du' will depend on when the leveldb compaction
runs. If you do find that compaction leads to suicide timeouts on this osd
(you would see a lot of 'leveldb:' output in the log), consider running
offline compaction by adding 'leveldb compact on mount = true' to the osd
config and restarting.

Casey


On 06/19/2017 11:01 AM, Dan van der Ster wrote:

On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodley  wrote:

On 06/14/2017 05:59 AM, Dan van der Ster wrote:

Dear ceph users,

Today we had O(100) slow requests which were caused by deep-scrubbing
of the metadata log:

2017-06-14 11:07:55.373184 osd.155
[2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d
deep-scrub starts
...
2017-06-14 11:22:04.143903 osd.155
[2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow
request 480.140904 seconds old, received at 2017-06-14
11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d
meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc
0=[] ondisk+write+known_if_redirected e7752) currently waiting for
scrub
...
2017-06-14 11:22:06.729306 osd.155
[2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d
deep-scrub ok

We have log_meta: true, log_data: false on this (our only) region [1],
which IIRC we setup to enable indexless buckets.

I'm obviously unfamiliar with rgw meta and data logging, and have a
few questions:

1. AFAIU, it is used by the rgw multisite feature. Is it safe to turn
it off when not using multisite?


It's a good idea to turn that off, yes.

First, make sure that you have configured a default realm/zonegroup/zone:

$ radosgw-admin realm default --rgw-realm   (you can
determine
realm name from 'radosgw-admin realm list')
$ radosgw-admin zonegroup default --rgw-zonegroup default
$ radosgw-admin zone default --rgw-zone default


Thanks. This had already been done, as confirmed with radosgw-admin
realm get-default.


Then you can modify the zonegroup (aka region):

$ radosgw-admin zonegroup get > zonegroup.json
$ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json
$ radosgw-admin zonegroup set < zonegroup.json

Then commit the updated period configuration:

$ radosgw-admin period update --commit

Verify that the resulting period contains "log_meta": "false". Take care
with future radosgw-admin commands on the zone/zonegroup, as they may
revert
log_meta back to true [1].


Great, this worked. FYI (and for others trying this in future), the
period update --commit blocks all rgws for ~30s while they reload the
realm.


2. I started dumping the output of radosgw-admin mdlog list, and
cancelled it after a few minutes. It had already dumped 3GB of json
and I don't know how much more it would have written. Is something
supposed to be trimming the mdlog automatically?


There is automated mdlog trimming logic in master, but not jewel/kraken.
And
this logic won't be triggered if there is only one zone [2].


3. ceph df doesn't show the space occupied by omap objects -- is
there an indirect way to see how much space these are using?


You can inspect the osd's omap directory: du -sh
/var/lib/ceph/osd/osd0/current/omap


Cool. osd.155 (holding shard 54) has 3.3GB of omap, compared with
~100-300MB on other OSDs.


4. mdlog status has markers going back to 2016-10, see [2]. I suppose
we're not using this feature correctly? :-/

5. Suppose I were to set log_meta: false -- how would I delete these
log entries now that they are not needed?


There is a 'radosgw-admin mdlog trim' command that can be used to trim
them
one 

Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-21 Thread Dan van der Ster
Hi Casey,

I managed to trim up all shards except for that big #54. The others
all trimmed within a few seconds.

But 54 is proving difficult. It's still going after several days, and
now I see that the 1000-key trim is indeed causing osd timeouts. I've
manually compacted the relevant osd leveldbs, but haven't found any
way to speed up the trimming. It's now going at ~1-2Hz, so 1000 trims
per op locks things up for quite awhile.

I'm thinking of running those ceph-osd's with this patch:

# git diff
diff --git a/src/cls/log/cls_log.cc b/src/cls/log/cls_log.cc
index 89745bb..7dcd933 100644
--- a/src/cls/log/cls_log.cc
+++ b/src/cls/log/cls_log.cc
@@ -254,7 +254,7 @@ static int cls_log_trim(cls_method_context_t hctx,
bufferlist *in, bufferlist *o
 to_index = op.to_marker;
   }

-#define MAX_TRIM_ENTRIES 1000
+#define MAX_TRIM_ENTRIES 10
   size_t max_entries = MAX_TRIM_ENTRIES;

   int rc = cls_cxx_map_get_vals(hctx, from_index, log_index_prefix,
max_entries, );


What do you think?

-- Dan




On Mon, Jun 19, 2017 at 5:32 PM, Casey Bodley  wrote:
> Hi Dan,
>
> That's good news that it can remove 1000 keys at a time without hitting
> timeouts. The output of 'du' will depend on when the leveldb compaction
> runs. If you do find that compaction leads to suicide timeouts on this osd
> (you would see a lot of 'leveldb:' output in the log), consider running
> offline compaction by adding 'leveldb compact on mount = true' to the osd
> config and restarting.
>
> Casey
>
>
> On 06/19/2017 11:01 AM, Dan van der Ster wrote:
>>
>> On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodley  wrote:
>>>
>>> On 06/14/2017 05:59 AM, Dan van der Ster wrote:

 Dear ceph users,

 Today we had O(100) slow requests which were caused by deep-scrubbing
 of the metadata log:

 2017-06-14 11:07:55.373184 osd.155
 [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d
 deep-scrub starts
 ...
 2017-06-14 11:22:04.143903 osd.155
 [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow
 request 480.140904 seconds old, received at 2017-06-14
 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d
 meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc
 0=[] ondisk+write+known_if_redirected e7752) currently waiting for
 scrub
 ...
 2017-06-14 11:22:06.729306 osd.155
 [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d
 deep-scrub ok

 We have log_meta: true, log_data: false on this (our only) region [1],
 which IIRC we setup to enable indexless buckets.

 I'm obviously unfamiliar with rgw meta and data logging, and have a
 few questions:

1. AFAIU, it is used by the rgw multisite feature. Is it safe to turn
 it off when not using multisite?
>>>
>>>
>>> It's a good idea to turn that off, yes.
>>>
>>> First, make sure that you have configured a default realm/zonegroup/zone:
>>>
>>> $ radosgw-admin realm default --rgw-realm   (you can
>>> determine
>>> realm name from 'radosgw-admin realm list')
>>> $ radosgw-admin zonegroup default --rgw-zonegroup default
>>> $ radosgw-admin zone default --rgw-zone default
>>>
>> Thanks. This had already been done, as confirmed with radosgw-admin
>> realm get-default.
>>
>>> Then you can modify the zonegroup (aka region):
>>>
>>> $ radosgw-admin zonegroup get > zonegroup.json
>>> $ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json
>>> $ radosgw-admin zonegroup set < zonegroup.json
>>>
>>> Then commit the updated period configuration:
>>>
>>> $ radosgw-admin period update --commit
>>>
>>> Verify that the resulting period contains "log_meta": "false". Take care
>>> with future radosgw-admin commands on the zone/zonegroup, as they may
>>> revert
>>> log_meta back to true [1].
>>>
>> Great, this worked. FYI (and for others trying this in future), the
>> period update --commit blocks all rgws for ~30s while they reload the
>> realm.
>>
2. I started dumping the output of radosgw-admin mdlog list, and
 cancelled it after a few minutes. It had already dumped 3GB of json
 and I don't know how much more it would have written. Is something
 supposed to be trimming the mdlog automatically?
>>>
>>>
>>> There is automated mdlog trimming logic in master, but not jewel/kraken.
>>> And
>>> this logic won't be triggered if there is only one zone [2].
>>>
3. ceph df doesn't show the space occupied by omap objects -- is
 there an indirect way to see how much space these are using?
>>>
>>>
>>> You can inspect the osd's omap directory: du -sh
>>> /var/lib/ceph/osd/osd0/current/omap
>>>
>> Cool. osd.155 (holding shard 54) has 3.3GB of omap, compared with
>> ~100-300MB on other OSDs.
>>
4. mdlog status has markers going back to 2016-10, see [2]. I suppose
 we're not using this feature correctly? :-/

5. Suppose I were to set log_meta: false -- how would I 

Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-19 Thread Casey Bodley

Hi Dan,

That's good news that it can remove 1000 keys at a time without hitting 
timeouts. The output of 'du' will depend on when the leveldb compaction 
runs. If you do find that compaction leads to suicide timeouts on this 
osd (you would see a lot of 'leveldb:' output in the log), consider 
running offline compaction by adding 'leveldb compact on mount = true' 
to the osd config and restarting.


Casey

On 06/19/2017 11:01 AM, Dan van der Ster wrote:

On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodley  wrote:

On 06/14/2017 05:59 AM, Dan van der Ster wrote:

Dear ceph users,

Today we had O(100) slow requests which were caused by deep-scrubbing
of the metadata log:

2017-06-14 11:07:55.373184 osd.155
[2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d
deep-scrub starts
...
2017-06-14 11:22:04.143903 osd.155
[2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow
request 480.140904 seconds old, received at 2017-06-14
11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d
meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc
0=[] ondisk+write+known_if_redirected e7752) currently waiting for
scrub
...
2017-06-14 11:22:06.729306 osd.155
[2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d
deep-scrub ok

We have log_meta: true, log_data: false on this (our only) region [1],
which IIRC we setup to enable indexless buckets.

I'm obviously unfamiliar with rgw meta and data logging, and have a
few questions:

   1. AFAIU, it is used by the rgw multisite feature. Is it safe to turn
it off when not using multisite?


It's a good idea to turn that off, yes.

First, make sure that you have configured a default realm/zonegroup/zone:

$ radosgw-admin realm default --rgw-realm   (you can determine
realm name from 'radosgw-admin realm list')
$ radosgw-admin zonegroup default --rgw-zonegroup default
$ radosgw-admin zone default --rgw-zone default


Thanks. This had already been done, as confirmed with radosgw-admin
realm get-default.


Then you can modify the zonegroup (aka region):

$ radosgw-admin zonegroup get > zonegroup.json
$ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json
$ radosgw-admin zonegroup set < zonegroup.json

Then commit the updated period configuration:

$ radosgw-admin period update --commit

Verify that the resulting period contains "log_meta": "false". Take care
with future radosgw-admin commands on the zone/zonegroup, as they may revert
log_meta back to true [1].


Great, this worked. FYI (and for others trying this in future), the
period update --commit blocks all rgws for ~30s while they reload the
realm.


   2. I started dumping the output of radosgw-admin mdlog list, and
cancelled it after a few minutes. It had already dumped 3GB of json
and I don't know how much more it would have written. Is something
supposed to be trimming the mdlog automatically?


There is automated mdlog trimming logic in master, but not jewel/kraken. And
this logic won't be triggered if there is only one zone [2].


   3. ceph df doesn't show the space occupied by omap objects -- is
there an indirect way to see how much space these are using?


You can inspect the osd's omap directory: du -sh
/var/lib/ceph/osd/osd0/current/omap


Cool. osd.155 (holding shard 54) has 3.3GB of omap, compared with
~100-300MB on other OSDs.


   4. mdlog status has markers going back to 2016-10, see [2]. I suppose
we're not using this feature correctly? :-/

   5. Suppose I were to set log_meta: false -- how would I delete these
log entries now that they are not needed?


There is a 'radosgw-admin mdlog trim' command that can be used to trim them
one --shard-id (from 0 to 63) at a time. An entire log shard can be trimmed
with:

$ radosgw-admin mdlog trim --shard-id 0 --period
8d4fcb63-c314-4f9a-b3b3-0e61719ec258 --end-time 2020-1-1

*However*, there is a risk that bulk operations on large omaps will affect
cluster health by taking down OSDs. Not only can this bulk deletion take
long enough to trigger the osd/filestore suicide timeouts, the resulting
leveldb compaction after deletion is likely to block other omap operations
and hit the timeouts as well. This seems likely in your case, based on the
fact that you're already having issues with scrub.

We did this directly on shard 54, and indeed the command is taking a
looong time (but with no slow requests or osds being marked down).
After 45 minutes, du is still 3.3GB, so I can't tell if it's
progressing. I see ~1000 _omap_rmkeys messages every ~2 seconds:

2017-06-19 16:57:34.347222 7fc602640700 15
filestore(/var/lib/ceph/osd/ceph-155) _omap_rmkeys
24.1d_head/#24:ba0cd17d:::met
a.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:head#
2017-06-19 16:57:34.347319 7fc602640700 10 filestore oid:
#24:ba0cd17d:::meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:h
ead# not skipping op, *spos 67765185.0.0
2017-06-19 16:57:34.347326 7fc602640700 10 filestore  > header.spos 0.0.0
2017-06-19 16:57:34.347351 7fc602640700 

Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-19 Thread Dan van der Ster
On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodley  wrote:
>
> On 06/14/2017 05:59 AM, Dan van der Ster wrote:
>>
>> Dear ceph users,
>>
>> Today we had O(100) slow requests which were caused by deep-scrubbing
>> of the metadata log:
>>
>> 2017-06-14 11:07:55.373184 osd.155
>> [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d
>> deep-scrub starts
>> ...
>> 2017-06-14 11:22:04.143903 osd.155
>> [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow
>> request 480.140904 seconds old, received at 2017-06-14
>> 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d
>> meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc
>> 0=[] ondisk+write+known_if_redirected e7752) currently waiting for
>> scrub
>> ...
>> 2017-06-14 11:22:06.729306 osd.155
>> [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d
>> deep-scrub ok
>>
>> We have log_meta: true, log_data: false on this (our only) region [1],
>> which IIRC we setup to enable indexless buckets.
>>
>> I'm obviously unfamiliar with rgw meta and data logging, and have a
>> few questions:
>>
>>   1. AFAIU, it is used by the rgw multisite feature. Is it safe to turn
>> it off when not using multisite?
>
>
> It's a good idea to turn that off, yes.
>
> First, make sure that you have configured a default realm/zonegroup/zone:
>
> $ radosgw-admin realm default --rgw-realm   (you can determine
> realm name from 'radosgw-admin realm list')
> $ radosgw-admin zonegroup default --rgw-zonegroup default
> $ radosgw-admin zone default --rgw-zone default
>

Thanks. This had already been done, as confirmed with radosgw-admin
realm get-default.

> Then you can modify the zonegroup (aka region):
>
> $ radosgw-admin zonegroup get > zonegroup.json
> $ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json
> $ radosgw-admin zonegroup set < zonegroup.json
>
> Then commit the updated period configuration:
>
> $ radosgw-admin period update --commit
>
> Verify that the resulting period contains "log_meta": "false". Take care
> with future radosgw-admin commands on the zone/zonegroup, as they may revert
> log_meta back to true [1].
>

Great, this worked. FYI (and for others trying this in future), the
period update --commit blocks all rgws for ~30s while they reload the
realm.

>>
>>   2. I started dumping the output of radosgw-admin mdlog list, and
>> cancelled it after a few minutes. It had already dumped 3GB of json
>> and I don't know how much more it would have written. Is something
>> supposed to be trimming the mdlog automatically?
>
>
> There is automated mdlog trimming logic in master, but not jewel/kraken. And
> this logic won't be triggered if there is only one zone [2].
>
>>
>>   3. ceph df doesn't show the space occupied by omap objects -- is
>> there an indirect way to see how much space these are using?
>
>
> You can inspect the osd's omap directory: du -sh
> /var/lib/ceph/osd/osd0/current/omap
>

Cool. osd.155 (holding shard 54) has 3.3GB of omap, compared with
~100-300MB on other OSDs.

>>   4. mdlog status has markers going back to 2016-10, see [2]. I suppose
>> we're not using this feature correctly? :-/
>>
>>   5. Suppose I were to set log_meta: false -- how would I delete these
>> log entries now that they are not needed?
>
>
> There is a 'radosgw-admin mdlog trim' command that can be used to trim them
> one --shard-id (from 0 to 63) at a time. An entire log shard can be trimmed
> with:
>
> $ radosgw-admin mdlog trim --shard-id 0 --period
> 8d4fcb63-c314-4f9a-b3b3-0e61719ec258 --end-time 2020-1-1
>
> *However*, there is a risk that bulk operations on large omaps will affect
> cluster health by taking down OSDs. Not only can this bulk deletion take
> long enough to trigger the osd/filestore suicide timeouts, the resulting
> leveldb compaction after deletion is likely to block other omap operations
> and hit the timeouts as well. This seems likely in your case, based on the
> fact that you're already having issues with scrub.

We did this directly on shard 54, and indeed the command is taking a
looong time (but with no slow requests or osds being marked down).
After 45 minutes, du is still 3.3GB, so I can't tell if it's
progressing. I see ~1000 _omap_rmkeys messages every ~2 seconds:

2017-06-19 16:57:34.347222 7fc602640700 15
filestore(/var/lib/ceph/osd/ceph-155) _omap_rmkeys
24.1d_head/#24:ba0cd17d:::met
a.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:head#
2017-06-19 16:57:34.347319 7fc602640700 10 filestore oid:
#24:ba0cd17d:::meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:h
ead# not skipping op, *spos 67765185.0.0
2017-06-19 16:57:34.347326 7fc602640700 10 filestore  > header.spos 0.0.0
2017-06-19 16:57:34.347351 7fc602640700 15
filestore(/var/lib/ceph/osd/ceph-155) _omap_rmkeys
24.1d_head/#24:ba0cd17d:::met
a.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:head#
2017-06-19 16:57:34.347373 7fc602640700 10 filestore oid:
#24:ba0cd17d:::meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:h
ead# 

Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-15 Thread Casey Bodley


On 06/14/2017 05:59 AM, Dan van der Ster wrote:

Dear ceph users,

Today we had O(100) slow requests which were caused by deep-scrubbing
of the metadata log:

2017-06-14 11:07:55.373184 osd.155
[2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d
deep-scrub starts
...
2017-06-14 11:22:04.143903 osd.155
[2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow
request 480.140904 seconds old, received at 2017-06-14
11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d
meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc
0=[] ondisk+write+known_if_redirected e7752) currently waiting for
scrub
...
2017-06-14 11:22:06.729306 osd.155
[2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d
deep-scrub ok

We have log_meta: true, log_data: false on this (our only) region [1],
which IIRC we setup to enable indexless buckets.

I'm obviously unfamiliar with rgw meta and data logging, and have a
few questions:

  1. AFAIU, it is used by the rgw multisite feature. Is it safe to turn
it off when not using multisite?


It's a good idea to turn that off, yes.

First, make sure that you have configured a default realm/zonegroup/zone:

$ radosgw-admin realm default --rgw-realm   (you can 
determine realm name from 'radosgw-admin realm list')

$ radosgw-admin zonegroup default --rgw-zonegroup default
$ radosgw-admin zone default --rgw-zone default

Then you can modify the zonegroup (aka region):

$ radosgw-admin zonegroup get > zonegroup.json
$ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json
$ radosgw-admin zonegroup set < zonegroup.json

Then commit the updated period configuration:

$ radosgw-admin period update --commit

Verify that the resulting period contains "log_meta": "false". Take care 
with future radosgw-admin commands on the zone/zonegroup, as they may 
revert log_meta back to true [1].




  2. I started dumping the output of radosgw-admin mdlog list, and
cancelled it after a few minutes. It had already dumped 3GB of json
and I don't know how much more it would have written. Is something
supposed to be trimming the mdlog automatically?


There is automated mdlog trimming logic in master, but not jewel/kraken. 
And this logic won't be triggered if there is only one zone [2].




  3. ceph df doesn't show the space occupied by omap objects -- is
there an indirect way to see how much space these are using?


You can inspect the osd's omap directory: du -sh 
/var/lib/ceph/osd/osd0/current/omap




  4. mdlog status has markers going back to 2016-10, see [2]. I suppose
we're not using this feature correctly? :-/

  5. Suppose I were to set log_meta: false -- how would I delete these
log entries now that they are not needed?


There is a 'radosgw-admin mdlog trim' command that can be used to trim 
them one --shard-id (from 0 to 63) at a time. An entire log shard can be 
trimmed with:


$ radosgw-admin mdlog trim --shard-id 0 --period 
8d4fcb63-c314-4f9a-b3b3-0e61719ec258 --end-time 2020-1-1


*However*, there is a risk that bulk operations on large omaps will 
affect cluster health by taking down OSDs. Not only can this bulk 
deletion take long enough to trigger the osd/filestore suicide timeouts, 
the resulting leveldb compaction after deletion is likely to block other 
omap operations and hit the timeouts as well. This seems likely in your 
case, based on the fact that you're already having issues with scrub.




Apologies if there are already good docs about this, which eluded my googling.

Best Regards,
Dan


[1] region get:

{
 "id": "61c0ff1a-4330-405a-9eb1-bb494d4daf82",
 "name": "default",
 "api_name": "default",
 "is_master": "true",
 "endpoints": [],
 "hostnames": [],
 "hostnames_s3website": [],
 "master_zone": "61c59385-085d-4caa-9070-63a3868dccb6",
 "zones": [
 {
 "id": "61c59385-085d-4caa-9070-63a3868dccb6",
 "name": "default",
 "endpoints": [],
 "log_meta": "true",
 "log_data": "false",
 "bucket_index_max_shards": 32,
 "read_only": "false"
 }
 ],
 "placement_targets": [
 {
 "name": "default-placement",
 "tags": []
 },
 {
 "name": "indexless",
 "tags": []
 }
 ],
 "default_placement": "default-placement",
 "realm_id": "552868ad-8898-4afb-a775-911297961cee"
}

[2] mdlog status:

No --period given, using current period=8d4fcb63-c314-4f9a-b3b3-0e61719ec258
[
...
 {
 "marker": "1_1475568296.712634_3.1",
 "last_update": "2016-10-04 08:04:56.712634Z"
 },
...
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[1] http://tracker.ceph.com/issues/20320
[2] http://tracker.ceph.com/issues/20319
___
ceph-users mailing list
ceph-users@lists.ceph.com