Re: [ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

Adrien Gillard Thu, 23 Aug 2018 08:43:07 -0700

With a bit of profiling, it seems all the memory is allocated to
ceph::logging::Log::create_entry (see below)


Shoould this be normal ? Is it because some OSDs are down and it logs the
results of its osd_ping  ?

The debug level of the OSD is below also.

Thanks,

Adrien

 $ pprof /usr/bin/ceph-osd osd.36.profile.0042-first.heap
Using local file /usr/bin/ceph-osd.
Using local file osd.36.profile.0042-first.heap.
Welcome to pprof!  For help, type 'help'.
(pprof) top
Total: 2468.0 MB
   519.5  21.0%  21.0%   1000.4  40.5% std::_Rb_tree::_M_copy
   481.0  19.5%  40.5%    481.0  19.5% std::_Rb_tree::_M_create_node
   384.3  15.6%  56.1%    384.3  15.6% std::_Rb_tree::_M_emplace_hint_unique
   374.0  15.2%  71.3%    374.0  15.2%
ceph::buffer::create_aligned_in_mempool
   305.6  12.4%  83.6%    305.6  12.4% ceph::logging::Log::create_entry
   217.3   8.8%  92.4%    220.2   8.9% std::vector::_M_emplace_back_aux
   114.4   4.6%  97.1%    115.6   4.7% ceph::buffer::list::append@a51900
    21.4   0.9%  98.0%    210.0   8.5% OSD::heartbeat
    21.2   0.9%  98.8%     21.2   0.9% std::string::_Rep::_S_create
     4.7   0.2%  99.0%    266.2  10.8% AsyncConnection::send_message
(pprof)


 $ pprof /usr/bin/ceph-osd --alloc_space osd.36.profile.0042-first.heap
Using local file /usr/bin/ceph-osd.
Using local file osd.36.profile.0042.heap.
Welcome to pprof!  For help, type 'help'.
(pprof) top
Total: 16519.3 MB
 11915.7  72.1%  72.1%  11915.7  72.1% ceph::logging::Log::create_entry
   745.0   4.5%  76.6%    745.0   4.5% std::string::_Rep::_S_create
   716.9   4.3%  81.0%    716.9   4.3%
ceph::buffer::create_aligned_in_mempool
   700.2   4.2%  85.2%    703.8   4.3% std::vector::_M_emplace_back_aux
   671.9   4.1%  89.3%    671.9   4.1% std::_Rb_tree::_M_emplace_hint_unique
   557.8   3.4%  92.7%   1075.2   6.5% std::_Rb_tree::_M_copy
   517.4   3.1%  95.8%    517.4   3.1% std::_Rb_tree::_M_create_node
   468.2   2.8%  98.6%    469.6   2.8% ceph::buffer::list::append@a51900
   122.8   0.7%  99.4%    122.8   0.7% ceph::buffer::list::append@a51d20
    33.3   0.2%  99.6%   7042.7  42.6% OSD::heartbeat
(pprof)


 $ ceph --admin-daemon /var/run/ceph/ceph-osd.36.asok config show | grep
debug
    "debug_none": "0/5",
    "debug_lockdep": "0/1",
    "debug_context": "0/1",
    "debug_crush": "1/1",
    "debug_mds": "1/5",
    "debug_mds_balancer": "1/5",
    "debug_mds_locker": "1/5",
    "debug_mds_log": "1/5",
    "debug_mds_log_expire": "1/5",
    "debug_mds_migrator": "1/5",
    "debug_buffer": "0/1",
    "debug_timer": "0/1",
    "debug_filer": "0/1",
    "debug_striper": "0/1",
    "debug_objecter": "0/1",
    "debug_rados": "0/5",
    "debug_rbd": "0/5",
    "debug_rbd_mirror": "0/5",
    "debug_rbd_replay": "0/5",
    "debug_journaler": "0/5",
    "debug_objectcacher": "0/5",
    "debug_client": "0/5",
    "debug_osd": "0/0",
    "debug_optracker": "0/5",
    "debug_objclass": "0/5",
    "debug_filestore": "0/0",
    "debug_journal": "0/0",
    "debug_ms": "0/5",
    "debug_mon": "1/5",
    "debug_monc": "0/10",
    "debug_paxos": "1/5",
    "debug_tp": "0/5",
    "debug_auth": "1/5",
    "debug_crypto": "1/5",
    "debug_finisher": "1/1",
    "debug_reserver": "1/1",
    "debug_heartbeatmap": "1/5",
    "debug_perfcounter": "1/5",
    "debug_rgw": "1/5",
    "debug_civetweb": "1/10",
    "debug_javaclient": "1/5",
    "debug_asok": "1/5",
    "debug_throttle": "1/1",
    "debug_refs": "0/0",
    "debug_xio": "1/5",
    "debug_compressor": "1/5",
    "debug_bluestore": "1/5",
    "debug_bluefs": "1/5",
    "debug_bdev": "1/3",
    "debug_kstore": "1/5",
    "debug_rocksdb": "4/5",
    "debug_leveldb": "4/5",
    "debug_memdb": "4/5",
    "debug_kinetic": "1/5",
    "debug_fuse": "1/5",
    "debug_mgr": "1/5",
    "debug_mgrc": "1/5",
    "debug_dpdk": "1/5",
    "debug_eventtrace": "1/5",


On Thu, Aug 23, 2018 at 4:27 PM Adrien Gillard <[email protected]>
wrote:

> After upgrading to luminous, we see the exact same behaviour, with OSDs
> eating as much as 80/90 GB of memory.
>
> We'll try some memory profiling but at this point we're a bit lost. Is
> there any specific logs that could help us ?
>
> On Thu, Aug 23, 2018 at 2:34 PM Adrien Gillard <[email protected]>
> wrote:
>
>> Well after a few hours, still nothing new in the behaviour. With half of
>> the OSDs (so 6 per host)  up and peering and the nodown flag set to limit
>> the creation of new maps, all the memory is consumed and OSDs get killed by
>> OOM killer.
>> We observe a lot of threads being created for each OSDs (roughly between
>> 500 and 900). As there is around 200 pg per osd, does this seem a lot to
>> you ? We see tons being created each second, most probably to replace the
>> threads that time out or disappear with the osd processes.
>>
>> We are preparing to upgrade the OSDs to Luminous.
>>
>> On Thu, Aug 23, 2018 at 4:05 AM Gregory Farnum <[email protected]>
>> wrote:
>>
>>>
>>>
>>> On Wed, Aug 22, 2018 at 6:02 PM Adrien Gillard <[email protected]>
>>> wrote:
>>>
>>>> We 'paused' the cluster early in our investigation to avoid unnecessary
>>>> IO.
>>>> We also set the nodown flag but the OOM rate was really sustained and
>>>> we got servers that stopped responding from time to time, so we decided to
>>>> lower the number of OSDs up and let them peer.
>>>>
>>>
>>> Right but this often also means that they need to deal with changing
>>> intervals or trying to set up recovery and things, plus the change in which
>>> OSDs are up creates new maps that need to be processed. If you set “nodown”
>>> you can incrementally bring up OSDs while doing a better job of throttling
>>> the map creation process and preventing a bunch of new IO when/if they OOM.
>>>
>>>
>>> I don't know if it is the best solution but it is the middle of the
>>>> night here, so we'll see tomorrow morning and ponder the completion of the
>>>> upgrade.
>>>>
>>>> I agree with you Greg, it does not make sense. The cluster was running
>>>> great for around 6 hours after the mons upgrade...
>>>>
>>>> Finally, regarding the cache-tiering, I know it is not used a lot nor
>>>> will it be maintained over time. One of the main goal of the luminous
>>>> upgrade is to remove it completely and switch to ec rbd.
>>>>
>>>> Thank you for your inputs !
>>>>
>>>>
>>>> On Wed, Aug 22, 2018 at 11:57 PM Gregory Farnum <[email protected]>
>>>> wrote:
>>>>
>>>>> On Wed, Aug 22, 2018 at 6:35 AM Adrien Gillard <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> We have a hard time figuring out a behaviour encountered after
>>>>>> upgrading the monitors of one of our cluster from Jewel to Luminous
>>>>>> yesterday.
>>>>>>
>>>>>> The cluster is composed of 14 OSDs hosts (2xE5-2640 v3 and 64 GB of
>>>>>> RAM), each containing 12x4TB OSD with journals on DC grade SSDs). The
>>>>>> cluster was installed in Infernalis and upgraded to Jewel. We mainly use 
>>>>>> it
>>>>>> for RBD with a cache tier in front of an EC pool.
>>>>>>
>>>>>> We started the upgrade from Jewel 10.2.10 to Luminous 12.2.7
>>>>>> yesterday afternoon with the monitors and the provisionning of the MGR. 
>>>>>> We
>>>>>> were planning on upgrading the OSD servers today.
>>>>>> All went well until the cluster received the following commands, a
>>>>>> few hours after the upgrade :
>>>>>>
>>>>>> ceph osd pool set pool-cache-01 target_max_bytes 0
>>>>>> ceph osd pool set pool-cache-02 target_max_bytes 0
>>>>>>
>>>>>> (The context here is that we use this cluster for backups and set
>>>>>> target_max_bytes to 0 to stop the flush and give maximum ressources to 
>>>>>> the
>>>>>> backups IOs)
>>>>>>
>>>>>> Below is an extract from the logs of one of the monitors.
>>>>>> The cluster started exchanging a lot over the cluster network,
>>>>>> apparently backfill/recovery and multiple Gbits/s, in an amount which 
>>>>>> we've
>>>>>> never seen before. Due to some network uplinks sub-sizing, the traffic
>>>>>> could not be handled well and this lead to OSDs crashing.
>>>>>>
>>>>>
>>>>> This really doesn't make a ton of sense — you're saying that by
>>>>> telling the cluster to *stop* doing flushing IO, everything went bad? In
>>>>> terms of root causing the issue you'll obviously want to focus on that,
>>>>> although it may or may not be helpful in recovering things at this point.
>>>>>
>>>>> Do keep in mind that users running cache-tiering are increasingly
>>>>> uncommon, and while we still test it we aren't expanding that coverage and
>>>>> would really like to remove and replace that functionality, so you're out
>>>>> on the far edge of things here. :/
>>>>>
>>>>> Finally, if you're still experiencing OOM issues, you may want to set
>>>>> the "nodown" flag — that will prevent OSDs being marked down, which
>>>>> obviously will stall IO and peering if they go down anyway, but will
>>>>> prevent spamming map updates that cause the cluster to OOM again.
>>>>> -Greg
>>>>>
>>>>>
>>>>>>
>>>>>> At the moment we are still investigating the logs to try to
>>>>>> understand what happened exactly. The cluster is down. When we restart 
>>>>>> the
>>>>>> OSDs, PG start activating and peering but at some point all the RAM and
>>>>>> swap of the servers is consumed and OSDs crash.
>>>>>> I tried to set the ovious cluster parameters
>>>>>> (osd_recovery_max_active=1, osd_recovery_threads=1, osd_max_backfills=1)
>>>>>> and norecover, nobackfill, but when the OSDs start, we still see 4 or 5
>>>>>> Gbits/s over the cluster network.
>>>>>>
>>>>>> We are currently reconfiguring the network to try to remove the
>>>>>> bandwith problem and are preparing to add more RAM to the servers.
>>>>>>
>>>>>> But still, I have a couple of questions :
>>>>>>
>>>>>>  - Has anyone encoutered the issue we have with the cache ? Or have a
>>>>>> possible explanation ?
>>>>>>  - Is it normal to observe so much traffic on the cluster network
>>>>>> when the PG are activating / peering with norecover and nobackfill set ?
>>>>>>
>>>>>> Thank you for reading and for your input !
>>>>>>
>>>>>>
>>>>>> 2018-08-21 21:05:02.766969 7f10de377700  0 log_channel(cluster) log
>>>>>> [INF] : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 225 
>>>>>> TB
>>>>>> / 611 TB avail; 298 MB/s rd, 212 MB/s wr, 456 op/s
>>>>>> 2018-08-21 21:05:03.805124 7f10de377700  0 log_channel(cluster) log
>>>>>> [INF] : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 225 
>>>>>> TB
>>>>>> / 611 TB avail; 296 MB/s rd, 193 MB/s wr, 433 op/s
>>>>>> 2018-08-21 21:05:03.870690 7f10d83cc700  0 
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader)
>>>>>> e6 handle_command mon_command({"var": "target_max_bytes", "prefix": "osd
>>>>>> pool set", "pool": "pool-cache-02", "val": "0"} v 0
>>>>>> ) v1
>>>>>> 2018-08-21 21:05:03.870749 7f10d83cc700  0 log_channel(audit) log
>>>>>> [INF] : from='client.146529236 -' entity='client.admin' cmd=[{"var":
>>>>>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-02
>>>>>> ", "val": "0"}]: dispatch
>>>>>> 2018-08-21 21:05:03.913663 7f10d83cc700  0 
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader)
>>>>>> e6 handle_command mon_command({"prefix": "osd pool stats", "format":
>>>>>> "json"} v 0) v1
>>>>>> 2018-08-21 21:05:03.913745 7f10d83cc700  0 log_channel(audit) log
>>>>>> [DBG] : from='client.? 10.25.25.3:0/4101432770'
>>>>>> entity='client.admin' cmd=[{"prefix": "osd pool stats", "format": 
>>>>>> "json"}]:
>>>>>> dispatch
>>>>>> 2018-08-21 21:05:04.042835 7f10d83cc700  0 
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader)
>>>>>> e6 handle_command mon_command({"var": "target_max_bytes", "prefix": "osd
>>>>>> pool set", "pool": "pool-cache-01", "val": "0"} v 0
>>>>>> ) v1
>>>>>> 2018-08-21 21:05:04.042894 7f10d83cc700  0 log_channel(audit) log
>>>>>> [INF] : from='client.146565481 -' entity='client.admin' cmd=[{"var":
>>>>>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-01
>>>>>> ", "val": "0"}]: dispatch
>>>>>> 2018-08-21 21:05:04.813217 7f10de377700  0 log_channel(cluster) log
>>>>>> [INF] : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 225 
>>>>>> TB
>>>>>> / 611 TB avail; 271 MB/s rd, 181 MB/s wr, 415 op/s
>>>>>> 2018-08-21 21:05:04.821486 7f10de377700  1 
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114693 e114693: 168 total, 168 up, 168 in
>>>>>> 2018-08-21 21:05:04.821711 7f10de377700  0 
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114693 crush map has features 432365734501548032, adjusting msgr 
>>>>>> requires
>>>>>> 2018-08-21 21:05:04.823032 7f10de377700  1 
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114693 create_pending setting backfillfull_ratio = 0.9
>>>>>> 2018-08-21 21:05:04.823156 7f10de377700  0 log_channel(audit) log
>>>>>> [INF] : from='client.146529236 -' entity='client.admin' cmd='[{"var":
>>>>>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-0
>>>>>> 2", "val": "0"}]': finished
>>>>>> 2018-08-21 21:05:04.823280 7f10de377700  0 log_channel(audit) log
>>>>>> [INF] : from='client.146565481 -' entity='client.admin' cmd='[{"var":
>>>>>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-0
>>>>>> 1", "val": "0"}]': finished
>>>>>> 2018-08-21 21:05:04.825528 7f10de377700  0 log_channel(cluster) log
>>>>>> [DBG] : osdmap e114693: 168 total, 168 up, 168 in
>>>>>> 2018-08-21 21:05:04.844932 7f10de377700  0 log_channel(cluster) log
>>>>>> [INF] : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 225 
>>>>>> TB
>>>>>> / 611 TB avail; 275 MB/s rd, 192 MB/s wr, 429 op/s
>>>>>> 2018-08-21 21:05:05.873682 7f10de377700  0 log_channel(cluster) log
>>>>>> [INF] : pgmap 6208 pgs: 187 peering, 6021 active+clean; 261 TB data, 385 
>>>>>> TB
>>>>>> used, 225 TB / 611 TB avail; 220 MB/s rd, 188 MB/s wr, 383
>>>>>> op/s
>>>>>> 2018-08-21 21:05:05.895414 7f10de377700  1 
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114694 e114694: 168 total, 168 up, 168 in
>>>>>> 2018-08-21 21:05:05.895721 7f10de377700  0 
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114694 crush map has features 432365734501548032, adjusting msgr 
>>>>>> requires
>>>>>> 2018-08-21 21:05:05.905473 7f10de377700  1 
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114694 create_pending setting backfillfull_ratio = 0.9
>>>>>> 2018-08-21 21:05:05.940729 7f10de377700  0 log_channel(cluster) log
>>>>>> [DBG] : osdmap e114694: 168 total, 168 up, 168 in
>>>>>> 2018-08-21 21:05:05.947921 7f10de377700  0 log_channel(cluster) log
>>>>>> [INF] : pgmap 6208 pgs: 204 peering, 6004 active+clean; 261 TB data, 385 
>>>>>> TB
>>>>>> used, 225 TB / 611 TB avail; 201 MB/s rd, 200 MB/s wr, 355
>>>>>> op/s
>>>>>> 2018-08-21 21:05:06.912752 7f10de377700  1 
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114695 e114695: 168 total, 168 up, 168 in
>>>>>> 2018-08-21 21:05:06.913104 7f10de377700  0 
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114695 crush map has features 432365734501548032, adjusting msgr 
>>>>>> requires
>>>>>> 2018-08-21 21:05:06.928521 7f10de377700  1 
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114695 create_pending setting backfillfull_ratio = 0.9
>>>>>> 2018-08-21 21:05:06.968793 7f10de377700  0 log_channel(cluster) log
>>>>>> [DBG] : osdmap e114695: 168 total, 168 up, 168 in
>>>>>> 2018-08-21 21:05:07.001112 7f10d93ce700  1 heartbeat_map
>>>>>> reset_timeout 'Monitor::cpu_tp thread 0x7f10d93ce700' had timed out 
>>>>>> after 0
>>>>>> 2018-08-21 21:05:07.004274 7f10de377700  0 log_channel(cluster) log
>>>>>> [INF] : pgmap 6208 pgs: 12 unknown, 114 remapped+peering, 5 activating, 
>>>>>> 481
>>>>>> peering, 5596 active+clean; 261 TB data, 385 TB used, 225 T
>>>>>> B / 611 TB avail; 177 MB/s rd, 173 MB/s wr, 317 op/s
>>>>>> 2018-08-21 21:05:07.025200 7f10d8bcd700  1 heartbeat_map
>>>>>> reset_timeout 'Monitor::cpu_tp thread 0x7f10d8bcd700' had timed out 
>>>>>> after 0
>>>>>> 2018-08-21 21:05:07.948138 7f10de377700  1 
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114696 e114696: 168 total, 168 up, 168 in
>>>>>> 2018-08-21 21:05:07.948487 7f10de377700  0 
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114696 crush map has features 432365734501548032, adjusting msgr 
>>>>>> requires
>>>>>> 2018-08-21 21:05:07.965671 7f10de377700  1 
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114696 create_pending setting backfillfull_ratio = 0.9
>>>>>> 2018-08-21 21:05:08.007585 7f10de377700  0 log_channel(cluster) log
>>>>>> [DBG] : osdmap e114696: 168 total, 168 up, 168 in
>>>>>> 2018-08-21 21:05:08.052153 7f10de377700  0 log_channel(cluster) log
>>>>>> [INF] : pgmap 6208 pgs: 46 activating+remapped, 15 unknown, 305
>>>>>> remapped+peering, 11 activating, 697 peering, 8 active+remapped+backfil
>>>>>> ling, 5126 active+clean; 261 TB data, 385 TB used, 225 TB / 611 TB
>>>>>> avail; 115 MB/s rd, 182 MB/s wr, 292 op/s; 26681/519327262 objects
>>>>>> misplaced (0.005%); 4856 kB/s, 3 objects/s recovering
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> [email protected]
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

Reply via email to