On Thu, Jun 22, 2017 at 5:31 PM, Casey Bodley <cbod...@redhat.com> wrote:
>
> On 06/22/2017 10:40 AM, Dan van der Ster wrote:
>>
>> On Thu, Jun 22, 2017 at 4:25 PM, Casey Bodley <cbod...@redhat.com> wrote:
>>>
>>> On 06/22/2017 04:00 AM, Dan van der Ster wrote:
>>>>
>>>> I'm now running the three relevant OSDs with that patch. (Recompiled,
>>>> replaced /usr/lib64/rados-classes/libcls_log.so with the new version,
>>>> then restarted the osds).
>>>>
>>>> It's working quite well, trimming 10 entries at a time instead of
>>>> 1000, and no more timeouts.
>>>>
>>>> Do you think it would be worth decreasing this hardcoded value in ceph
>>>> proper?
>>>>
>>>> -- Dan
>>>
>>>
>>> I do, yeah. At least, the trim operation should be able to pass in its
>>> own
>>> value for that. I opened a ticket for that at
>>> http://tracker.ceph.com/issues/20382.
>>>
>>> I'd also like to investigate using the ObjectStore's OP_OMAP_RMKEYRANGE
>>> operation to trim a range of keys in a single osd op, instead of
>>> generating
>>> a different op for each key. I have a PR that does this at
>>> https://github.com/ceph/ceph/pull/15183. But it's still hard to guarantee
>>> that leveldb can process the entire range inside of the suicide timeout.
>>
>> I wonder if that would help. Here's what I've learned today:
>>
>>    * two of the 3 relevant OSDs have something screwy with their
>> leveldb. The primary and 3rd replica are ~quick at trimming for only a
>> few hundred keys, whilst the 2nd OSD is very very fast always.
>>    * After manually compacting the two slow OSDs, they are fast again
>> for just a few hundred trims. So I'm compacting, trimming, ..., in a
>> loop now.
>>    * I moved the omaps to SSDs -- doesn't help. (iostat confirms this
>> is not IO bound).
>>    * CPU util on the slow OSDs gets quite high during the slow trimming.
>>    * perf top is below [1]. leveldb::Block::Iter::Prev and
>> leveldb::InternalKeyComparator::Compare are notable.
>>    * The always fast OSD shows no leveldb functions in perf top while
>> trimming.
>>
>> I've tried bigger leveldb cache and block sizes, compression on and
>> off, and played with the bloom size up to 14 bits -- none of these
>> changes make any difference.
>>
>> At this point I'm not confident this trimming will ever complete --
>> there are ~20 million records to remove at maybe 1Hz.
>>
>> How about I just delete the meta.log object? Would this use a
>> different, perhaps quicker, code path to remove those omap keys?
>>
>> Thanks!
>>
>> Dan
>>
>> [1]
>>
>>     4.92%  libtcmalloc.so.4.2.6;5873e42b (deleted)          [.]
>> 0x0000000000023e8d
>>     4.47%  libc-2.17.so                                     [.]
>> __memcmp_sse4_1
>>     4.13%  libtcmalloc.so.4.2.6;5873e42b (deleted)          [.]
>> 0x00000000000273bb
>>     3.81%  libleveldb.so.1.0.7                              [.]
>> leveldb::Block::Iter::Prev
>>     3.07%  libc-2.17.so                                     [.]
>> __memcpy_ssse3_back
>>     2.84%  [kernel]                                         [k] port_inb
>>     2.77%  libstdc++.so.6.0.19                              [.]
>> std::string::_M_mutate
>>     2.75%  libstdc++.so.6.0.19                              [.]
>> std::string::append
>>     2.53%  libleveldb.so.1.0.7                              [.]
>> leveldb::InternalKeyComparator::Compare
>>     1.32%  libtcmalloc.so.4.2.6;5873e42b (deleted)          [.]
>> 0x0000000000023e77
>>     0.85%  [kernel]                                         [k]
>> _raw_spin_lock
>>     0.80%  libleveldb.so.1.0.7                              [.]
>> leveldb::Block::Iter::Next
>>     0.77%  libtcmalloc.so.4.2.6;5873e42b (deleted)          [.]
>> 0x0000000000023a05
>>     0.67%  libleveldb.so.1.0.7                              [.]
>> leveldb::MemTable::KeyComparator::operator()
>>     0.61%  libtcmalloc.so.4.2.6;5873e42b (deleted)          [.]
>> 0x0000000000023a09
>>     0.58%  libleveldb.so.1.0.7                              [.]
>> leveldb::MemTableIterator::Prev
>>     0.51%  [kernel]                                         [k] __schedule
>>     0.48%  libruby.so.2.1.0                                 [.]
>> ruby_yyparse
>
>
> Hi Dan,
>
> Removing an object will try to delete all of its keys at once, which should
> be much faster. It's also very likely to hit your suicide timeout, so you'll
> have to keep retrying until it stops killing your osd.

Well, that was quick. The object delete took around 30s. I then
restarted the osd to compact it, and now the leveldb is ~100MB. Phew!

In summary, if someone finds themselves in this predicament (huge
mdlog on a single-region rgw cluster), I'd advise to turn it off, then
just delete the meta.log objects.

Thanks!

Dan
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to