With a bit of profiling, it seems all the memory is allocated to
ceph::logging::Log::create_entry (see below)
Shoould this be normal ? Is it because some OSDs are down and it logs the
results of its osd_ping ?
The debug level of the OSD is below also.
Thanks,
Adrien
$ pprof /usr/bin/ceph-osd osd.36.profile.0042-first.heap
Using local file /usr/bin/ceph-osd.
Using local file osd.36.profile.0042-first.heap.
Welcome to pprof! For help, type 'help'.
(pprof) top
Total: 2468.0 MB
519.5 21.0% 21.0% 1000.4 40.5% std::_Rb_tree::_M_copy
481.0 19.5% 40.5% 481.0 19.5% std::_Rb_tree::_M_create_node
384.3 15.6% 56.1% 384.3 15.6% std::_Rb_tree::_M_emplace_hint_unique
374.0 15.2% 71.3% 374.0 15.2%
ceph::buffer::create_aligned_in_mempool
305.6 12.4% 83.6% 305.6 12.4% ceph::logging::Log::create_entry
217.3 8.8% 92.4% 220.2 8.9% std::vector::_M_emplace_back_aux
114.4 4.6% 97.1% 115.6 4.7% ceph::buffer::list::append@a51900
21.4 0.9% 98.0% 210.0 8.5% OSD::heartbeat
21.2 0.9% 98.8% 21.2 0.9% std::string::_Rep::_S_create
4.7 0.2% 99.0% 266.2 10.8% AsyncConnection::send_message
(pprof)
$ pprof /usr/bin/ceph-osd --alloc_space osd.36.profile.0042-first.heap
Using local file /usr/bin/ceph-osd.
Using local file osd.36.profile.0042.heap.
Welcome to pprof! For help, type 'help'.
(pprof) top
Total: 16519.3 MB
11915.7 72.1% 72.1% 11915.7 72.1% ceph::logging::Log::create_entry
745.0 4.5% 76.6% 745.0 4.5% std::string::_Rep::_S_create
716.9 4.3% 81.0% 716.9 4.3%
ceph::buffer::create_aligned_in_mempool
700.2 4.2% 85.2% 703.8 4.3% std::vector::_M_emplace_back_aux
671.9 4.1% 89.3% 671.9 4.1% std::_Rb_tree::_M_emplace_hint_unique
557.8 3.4% 92.7% 1075.2 6.5% std::_Rb_tree::_M_copy
517.4 3.1% 95.8% 517.4 3.1% std::_Rb_tree::_M_create_node
468.2 2.8% 98.6% 469.6 2.8% ceph::buffer::list::append@a51900
122.8 0.7% 99.4% 122.8 0.7% ceph::buffer::list::append@a51d20
33.3 0.2% 99.6% 7042.7 42.6% OSD::heartbeat
(pprof)
$ ceph --admin-daemon /var/run/ceph/ceph-osd.36.asok config show | grep
debug
"debug_none": "0/5",
"debug_lockdep": "0/1",
"debug_context": "0/1",
"debug_crush": "1/1",
"debug_mds": "1/5",
"debug_mds_balancer": "1/5",
"debug_mds_locker": "1/5",
"debug_mds_log": "1/5",
"debug_mds_log_expire": "1/5",
"debug_mds_migrator": "1/5",
"debug_buffer": "0/1",
"debug_timer": "0/1",
"debug_filer": "0/1",
"debug_striper": "0/1",
"debug_objecter": "0/1",
"debug_rados": "0/5",
"debug_rbd": "0/5",
"debug_rbd_mirror": "0/5",
"debug_rbd_replay": "0/5",
"debug_journaler": "0/5",
"debug_objectcacher": "0/5",
"debug_client": "0/5",
"debug_osd": "0/0",
"debug_optracker": "0/5",
"debug_objclass": "0/5",
"debug_filestore": "0/0",
"debug_journal": "0/0",
"debug_ms": "0/5",
"debug_mon": "1/5",
"debug_monc": "0/10",
"debug_paxos": "1/5",
"debug_tp": "0/5",
"debug_auth": "1/5",
"debug_crypto": "1/5",
"debug_finisher": "1/1",
"debug_reserver": "1/1",
"debug_heartbeatmap": "1/5",
"debug_perfcounter": "1/5",
"debug_rgw": "1/5",
"debug_civetweb": "1/10",
"debug_javaclient": "1/5",
"debug_asok": "1/5",
"debug_throttle": "1/1",
"debug_refs": "0/0",
"debug_xio": "1/5",
"debug_compressor": "1/5",
"debug_bluestore": "1/5",
"debug_bluefs": "1/5",
"debug_bdev": "1/3",
"debug_kstore": "1/5",
"debug_rocksdb": "4/5",
"debug_leveldb": "4/5",
"debug_memdb": "4/5",
"debug_kinetic": "1/5",
"debug_fuse": "1/5",
"debug_mgr": "1/5",
"debug_mgrc": "1/5",
"debug_dpdk": "1/5",
"debug_eventtrace": "1/5",
On Thu, Aug 23, 2018 at 4:27 PM Adrien Gillard <[email protected]>
wrote:
> After upgrading to luminous, we see the exact same behaviour, with OSDs
> eating as much as 80/90 GB of memory.
>
> We'll try some memory profiling but at this point we're a bit lost. Is
> there any specific logs that could help us ?
>
> On Thu, Aug 23, 2018 at 2:34 PM Adrien Gillard <[email protected]>
> wrote:
>
>> Well after a few hours, still nothing new in the behaviour. With half of
>> the OSDs (so 6 per host) up and peering and the nodown flag set to limit
>> the creation of new maps, all the memory is consumed and OSDs get killed by
>> OOM killer.
>> We observe a lot of threads being created for each OSDs (roughly between
>> 500 and 900). As there is around 200 pg per osd, does this seem a lot to
>> you ? We see tons being created each second, most probably to replace the
>> threads that time out or disappear with the osd processes.
>>
>> We are preparing to upgrade the OSDs to Luminous.
>>
>> On Thu, Aug 23, 2018 at 4:05 AM Gregory Farnum <[email protected]>
>> wrote:
>>
>>>
>>>
>>> On Wed, Aug 22, 2018 at 6:02 PM Adrien Gillard <[email protected]>
>>> wrote:
>>>
>>>> We 'paused' the cluster early in our investigation to avoid unnecessary
>>>> IO.
>>>> We also set the nodown flag but the OOM rate was really sustained and
>>>> we got servers that stopped responding from time to time, so we decided to
>>>> lower the number of OSDs up and let them peer.
>>>>
>>>
>>> Right but this often also means that they need to deal with changing
>>> intervals or trying to set up recovery and things, plus the change in which
>>> OSDs are up creates new maps that need to be processed. If you set “nodown”
>>> you can incrementally bring up OSDs while doing a better job of throttling
>>> the map creation process and preventing a bunch of new IO when/if they OOM.
>>>
>>>
>>> I don't know if it is the best solution but it is the middle of the
>>>> night here, so we'll see tomorrow morning and ponder the completion of the
>>>> upgrade.
>>>>
>>>> I agree with you Greg, it does not make sense. The cluster was running
>>>> great for around 6 hours after the mons upgrade...
>>>>
>>>> Finally, regarding the cache-tiering, I know it is not used a lot nor
>>>> will it be maintained over time. One of the main goal of the luminous
>>>> upgrade is to remove it completely and switch to ec rbd.
>>>>
>>>> Thank you for your inputs !
>>>>
>>>>
>>>> On Wed, Aug 22, 2018 at 11:57 PM Gregory Farnum <[email protected]>
>>>> wrote:
>>>>
>>>>> On Wed, Aug 22, 2018 at 6:35 AM Adrien Gillard <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> We have a hard time figuring out a behaviour encountered after
>>>>>> upgrading the monitors of one of our cluster from Jewel to Luminous
>>>>>> yesterday.
>>>>>>
>>>>>> The cluster is composed of 14 OSDs hosts (2xE5-2640 v3 and 64 GB of
>>>>>> RAM), each containing 12x4TB OSD with journals on DC grade SSDs). The
>>>>>> cluster was installed in Infernalis and upgraded to Jewel. We mainly use
>>>>>> it
>>>>>> for RBD with a cache tier in front of an EC pool.
>>>>>>
>>>>>> We started the upgrade from Jewel 10.2.10 to Luminous 12.2.7
>>>>>> yesterday afternoon with the monitors and the provisionning of the MGR.
>>>>>> We
>>>>>> were planning on upgrading the OSD servers today.
>>>>>> All went well until the cluster received the following commands, a
>>>>>> few hours after the upgrade :
>>>>>>
>>>>>> ceph osd pool set pool-cache-01 target_max_bytes 0
>>>>>> ceph osd pool set pool-cache-02 target_max_bytes 0
>>>>>>
>>>>>> (The context here is that we use this cluster for backups and set
>>>>>> target_max_bytes to 0 to stop the flush and give maximum ressources to
>>>>>> the
>>>>>> backups IOs)
>>>>>>
>>>>>> Below is an extract from the logs of one of the monitors.
>>>>>> The cluster started exchanging a lot over the cluster network,
>>>>>> apparently backfill/recovery and multiple Gbits/s, in an amount which
>>>>>> we've
>>>>>> never seen before. Due to some network uplinks sub-sizing, the traffic
>>>>>> could not be handled well and this lead to OSDs crashing.
>>>>>>
>>>>>
>>>>> This really doesn't make a ton of sense — you're saying that by
>>>>> telling the cluster to *stop* doing flushing IO, everything went bad? In
>>>>> terms of root causing the issue you'll obviously want to focus on that,
>>>>> although it may or may not be helpful in recovering things at this point.
>>>>>
>>>>> Do keep in mind that users running cache-tiering are increasingly
>>>>> uncommon, and while we still test it we aren't expanding that coverage and
>>>>> would really like to remove and replace that functionality, so you're out
>>>>> on the far edge of things here. :/
>>>>>
>>>>> Finally, if you're still experiencing OOM issues, you may want to set
>>>>> the "nodown" flag — that will prevent OSDs being marked down, which
>>>>> obviously will stall IO and peering if they go down anyway, but will
>>>>> prevent spamming map updates that cause the cluster to OOM again.
>>>>> -Greg
>>>>>
>>>>>
>>>>>>
>>>>>> At the moment we are still investigating the logs to try to
>>>>>> understand what happened exactly. The cluster is down. When we restart
>>>>>> the
>>>>>> OSDs, PG start activating and peering but at some point all the RAM and
>>>>>> swap of the servers is consumed and OSDs crash.
>>>>>> I tried to set the ovious cluster parameters
>>>>>> (osd_recovery_max_active=1, osd_recovery_threads=1, osd_max_backfills=1)
>>>>>> and norecover, nobackfill, but when the OSDs start, we still see 4 or 5
>>>>>> Gbits/s over the cluster network.
>>>>>>
>>>>>> We are currently reconfiguring the network to try to remove the
>>>>>> bandwith problem and are preparing to add more RAM to the servers.
>>>>>>
>>>>>> But still, I have a couple of questions :
>>>>>>
>>>>>> - Has anyone encoutered the issue we have with the cache ? Or have a
>>>>>> possible explanation ?
>>>>>> - Is it normal to observe so much traffic on the cluster network
>>>>>> when the PG are activating / peering with norecover and nobackfill set ?
>>>>>>
>>>>>> Thank you for reading and for your input !
>>>>>>
>>>>>>
>>>>>> 2018-08-21 21:05:02.766969 7f10de377700 0 log_channel(cluster) log
>>>>>> [INF] : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 225
>>>>>> TB
>>>>>> / 611 TB avail; 298 MB/s rd, 212 MB/s wr, 456 op/s
>>>>>> 2018-08-21 21:05:03.805124 7f10de377700 0 log_channel(cluster) log
>>>>>> [INF] : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 225
>>>>>> TB
>>>>>> / 611 TB avail; 296 MB/s rd, 193 MB/s wr, 433 op/s
>>>>>> 2018-08-21 21:05:03.870690 7f10d83cc700 0
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader)
>>>>>> e6 handle_command mon_command({"var": "target_max_bytes", "prefix": "osd
>>>>>> pool set", "pool": "pool-cache-02", "val": "0"} v 0
>>>>>> ) v1
>>>>>> 2018-08-21 21:05:03.870749 7f10d83cc700 0 log_channel(audit) log
>>>>>> [INF] : from='client.146529236 -' entity='client.admin' cmd=[{"var":
>>>>>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-02
>>>>>> ", "val": "0"}]: dispatch
>>>>>> 2018-08-21 21:05:03.913663 7f10d83cc700 0
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader)
>>>>>> e6 handle_command mon_command({"prefix": "osd pool stats", "format":
>>>>>> "json"} v 0) v1
>>>>>> 2018-08-21 21:05:03.913745 7f10d83cc700 0 log_channel(audit) log
>>>>>> [DBG] : from='client.? 10.25.25.3:0/4101432770'
>>>>>> entity='client.admin' cmd=[{"prefix": "osd pool stats", "format":
>>>>>> "json"}]:
>>>>>> dispatch
>>>>>> 2018-08-21 21:05:04.042835 7f10d83cc700 0
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader)
>>>>>> e6 handle_command mon_command({"var": "target_max_bytes", "prefix": "osd
>>>>>> pool set", "pool": "pool-cache-01", "val": "0"} v 0
>>>>>> ) v1
>>>>>> 2018-08-21 21:05:04.042894 7f10d83cc700 0 log_channel(audit) log
>>>>>> [INF] : from='client.146565481 -' entity='client.admin' cmd=[{"var":
>>>>>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-01
>>>>>> ", "val": "0"}]: dispatch
>>>>>> 2018-08-21 21:05:04.813217 7f10de377700 0 log_channel(cluster) log
>>>>>> [INF] : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 225
>>>>>> TB
>>>>>> / 611 TB avail; 271 MB/s rd, 181 MB/s wr, 415 op/s
>>>>>> 2018-08-21 21:05:04.821486 7f10de377700 1
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114693 e114693: 168 total, 168 up, 168 in
>>>>>> 2018-08-21 21:05:04.821711 7f10de377700 0
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114693 crush map has features 432365734501548032, adjusting msgr
>>>>>> requires
>>>>>> 2018-08-21 21:05:04.823032 7f10de377700 1
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114693 create_pending setting backfillfull_ratio = 0.9
>>>>>> 2018-08-21 21:05:04.823156 7f10de377700 0 log_channel(audit) log
>>>>>> [INF] : from='client.146529236 -' entity='client.admin' cmd='[{"var":
>>>>>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-0
>>>>>> 2", "val": "0"}]': finished
>>>>>> 2018-08-21 21:05:04.823280 7f10de377700 0 log_channel(audit) log
>>>>>> [INF] : from='client.146565481 -' entity='client.admin' cmd='[{"var":
>>>>>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-0
>>>>>> 1", "val": "0"}]': finished
>>>>>> 2018-08-21 21:05:04.825528 7f10de377700 0 log_channel(cluster) log
>>>>>> [DBG] : osdmap e114693: 168 total, 168 up, 168 in
>>>>>> 2018-08-21 21:05:04.844932 7f10de377700 0 log_channel(cluster) log
>>>>>> [INF] : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 225
>>>>>> TB
>>>>>> / 611 TB avail; 275 MB/s rd, 192 MB/s wr, 429 op/s
>>>>>> 2018-08-21 21:05:05.873682 7f10de377700 0 log_channel(cluster) log
>>>>>> [INF] : pgmap 6208 pgs: 187 peering, 6021 active+clean; 261 TB data, 385
>>>>>> TB
>>>>>> used, 225 TB / 611 TB avail; 220 MB/s rd, 188 MB/s wr, 383
>>>>>> op/s
>>>>>> 2018-08-21 21:05:05.895414 7f10de377700 1
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114694 e114694: 168 total, 168 up, 168 in
>>>>>> 2018-08-21 21:05:05.895721 7f10de377700 0
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114694 crush map has features 432365734501548032, adjusting msgr
>>>>>> requires
>>>>>> 2018-08-21 21:05:05.905473 7f10de377700 1
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114694 create_pending setting backfillfull_ratio = 0.9
>>>>>> 2018-08-21 21:05:05.940729 7f10de377700 0 log_channel(cluster) log
>>>>>> [DBG] : osdmap e114694: 168 total, 168 up, 168 in
>>>>>> 2018-08-21 21:05:05.947921 7f10de377700 0 log_channel(cluster) log
>>>>>> [INF] : pgmap 6208 pgs: 204 peering, 6004 active+clean; 261 TB data, 385
>>>>>> TB
>>>>>> used, 225 TB / 611 TB avail; 201 MB/s rd, 200 MB/s wr, 355
>>>>>> op/s
>>>>>> 2018-08-21 21:05:06.912752 7f10de377700 1
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114695 e114695: 168 total, 168 up, 168 in
>>>>>> 2018-08-21 21:05:06.913104 7f10de377700 0
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114695 crush map has features 432365734501548032, adjusting msgr
>>>>>> requires
>>>>>> 2018-08-21 21:05:06.928521 7f10de377700 1
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114695 create_pending setting backfillfull_ratio = 0.9
>>>>>> 2018-08-21 21:05:06.968793 7f10de377700 0 log_channel(cluster) log
>>>>>> [DBG] : osdmap e114695: 168 total, 168 up, 168 in
>>>>>> 2018-08-21 21:05:07.001112 7f10d93ce700 1 heartbeat_map
>>>>>> reset_timeout 'Monitor::cpu_tp thread 0x7f10d93ce700' had timed out
>>>>>> after 0
>>>>>> 2018-08-21 21:05:07.004274 7f10de377700 0 log_channel(cluster) log
>>>>>> [INF] : pgmap 6208 pgs: 12 unknown, 114 remapped+peering, 5 activating,
>>>>>> 481
>>>>>> peering, 5596 active+clean; 261 TB data, 385 TB used, 225 T
>>>>>> B / 611 TB avail; 177 MB/s rd, 173 MB/s wr, 317 op/s
>>>>>> 2018-08-21 21:05:07.025200 7f10d8bcd700 1 heartbeat_map
>>>>>> reset_timeout 'Monitor::cpu_tp thread 0x7f10d8bcd700' had timed out
>>>>>> after 0
>>>>>> 2018-08-21 21:05:07.948138 7f10de377700 1
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114696 e114696: 168 total, 168 up, 168 in
>>>>>> 2018-08-21 21:05:07.948487 7f10de377700 0
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114696 crush map has features 432365734501548032, adjusting msgr
>>>>>> requires
>>>>>> 2018-08-21 21:05:07.965671 7f10de377700 1
>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>>> e114696 create_pending setting backfillfull_ratio = 0.9
>>>>>> 2018-08-21 21:05:08.007585 7f10de377700 0 log_channel(cluster) log
>>>>>> [DBG] : osdmap e114696: 168 total, 168 up, 168 in
>>>>>> 2018-08-21 21:05:08.052153 7f10de377700 0 log_channel(cluster) log
>>>>>> [INF] : pgmap 6208 pgs: 46 activating+remapped, 15 unknown, 305
>>>>>> remapped+peering, 11 activating, 697 peering, 8 active+remapped+backfil
>>>>>> ling, 5126 active+clean; 261 TB data, 385 TB used, 225 TB / 611 TB
>>>>>> avail; 115 MB/s rd, 182 MB/s wr, 292 op/s; 26681/519327262 objects
>>>>>> misplaced (0.005%); 4856 kB/s, 3 objects/s recovering
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> [email protected]
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com