After upgrading to luminous, we see the exact same behaviour, with OSDs eating as much as 80/90 GB of memory.
We'll try some memory profiling but at this point we're a bit lost. Is there any specific logs that could help us ? On Thu, Aug 23, 2018 at 2:34 PM Adrien Gillard <[email protected]> wrote: > Well after a few hours, still nothing new in the behaviour. With half of > the OSDs (so 6 per host) up and peering and the nodown flag set to limit > the creation of new maps, all the memory is consumed and OSDs get killed by > OOM killer. > We observe a lot of threads being created for each OSDs (roughly between > 500 and 900). As there is around 200 pg per osd, does this seem a lot to > you ? We see tons being created each second, most probably to replace the > threads that time out or disappear with the osd processes. > > We are preparing to upgrade the OSDs to Luminous. > > On Thu, Aug 23, 2018 at 4:05 AM Gregory Farnum <[email protected]> wrote: > >> >> >> On Wed, Aug 22, 2018 at 6:02 PM Adrien Gillard <[email protected]> >> wrote: >> >>> We 'paused' the cluster early in our investigation to avoid unnecessary >>> IO. >>> We also set the nodown flag but the OOM rate was really sustained and we >>> got servers that stopped responding from time to time, so we decided to >>> lower the number of OSDs up and let them peer. >>> >> >> Right but this often also means that they need to deal with changing >> intervals or trying to set up recovery and things, plus the change in which >> OSDs are up creates new maps that need to be processed. If you set “nodown” >> you can incrementally bring up OSDs while doing a better job of throttling >> the map creation process and preventing a bunch of new IO when/if they OOM. >> >> >> I don't know if it is the best solution but it is the middle of the night >>> here, so we'll see tomorrow morning and ponder the completion of the >>> upgrade. >>> >>> I agree with you Greg, it does not make sense. The cluster was running >>> great for around 6 hours after the mons upgrade... >>> >>> Finally, regarding the cache-tiering, I know it is not used a lot nor >>> will it be maintained over time. One of the main goal of the luminous >>> upgrade is to remove it completely and switch to ec rbd. >>> >>> Thank you for your inputs ! >>> >>> >>> On Wed, Aug 22, 2018 at 11:57 PM Gregory Farnum <[email protected]> >>> wrote: >>> >>>> On Wed, Aug 22, 2018 at 6:35 AM Adrien Gillard < >>>> [email protected]> wrote: >>>> >>>>> Hi everyone, >>>>> >>>>> We have a hard time figuring out a behaviour encountered after >>>>> upgrading the monitors of one of our cluster from Jewel to Luminous >>>>> yesterday. >>>>> >>>>> The cluster is composed of 14 OSDs hosts (2xE5-2640 v3 and 64 GB of >>>>> RAM), each containing 12x4TB OSD with journals on DC grade SSDs). The >>>>> cluster was installed in Infernalis and upgraded to Jewel. We mainly use >>>>> it >>>>> for RBD with a cache tier in front of an EC pool. >>>>> >>>>> We started the upgrade from Jewel 10.2.10 to Luminous 12.2.7 yesterday >>>>> afternoon with the monitors and the provisionning of the MGR. We were >>>>> planning on upgrading the OSD servers today. >>>>> All went well until the cluster received the following commands, a few >>>>> hours after the upgrade : >>>>> >>>>> ceph osd pool set pool-cache-01 target_max_bytes 0 >>>>> ceph osd pool set pool-cache-02 target_max_bytes 0 >>>>> >>>>> (The context here is that we use this cluster for backups and set >>>>> target_max_bytes to 0 to stop the flush and give maximum ressources to the >>>>> backups IOs) >>>>> >>>>> Below is an extract from the logs of one of the monitors. >>>>> The cluster started exchanging a lot over the cluster network, >>>>> apparently backfill/recovery and multiple Gbits/s, in an amount which >>>>> we've >>>>> never seen before. Due to some network uplinks sub-sizing, the traffic >>>>> could not be handled well and this lead to OSDs crashing. >>>>> >>>> >>>> This really doesn't make a ton of sense — you're saying that by telling >>>> the cluster to *stop* doing flushing IO, everything went bad? In terms of >>>> root causing the issue you'll obviously want to focus on that, although it >>>> may or may not be helpful in recovering things at this point. >>>> >>>> Do keep in mind that users running cache-tiering are increasingly >>>> uncommon, and while we still test it we aren't expanding that coverage and >>>> would really like to remove and replace that functionality, so you're out >>>> on the far edge of things here. :/ >>>> >>>> Finally, if you're still experiencing OOM issues, you may want to set >>>> the "nodown" flag — that will prevent OSDs being marked down, which >>>> obviously will stall IO and peering if they go down anyway, but will >>>> prevent spamming map updates that cause the cluster to OOM again. >>>> -Greg >>>> >>>> >>>>> >>>>> At the moment we are still investigating the logs to try to understand >>>>> what happened exactly. The cluster is down. When we restart the OSDs, PG >>>>> start activating and peering but at some point all the RAM and swap of the >>>>> servers is consumed and OSDs crash. >>>>> I tried to set the ovious cluster parameters >>>>> (osd_recovery_max_active=1, osd_recovery_threads=1, osd_max_backfills=1) >>>>> and norecover, nobackfill, but when the OSDs start, we still see 4 or 5 >>>>> Gbits/s over the cluster network. >>>>> >>>>> We are currently reconfiguring the network to try to remove the >>>>> bandwith problem and are preparing to add more RAM to the servers. >>>>> >>>>> But still, I have a couple of questions : >>>>> >>>>> - Has anyone encoutered the issue we have with the cache ? Or have a >>>>> possible explanation ? >>>>> - Is it normal to observe so much traffic on the cluster network when >>>>> the PG are activating / peering with norecover and nobackfill set ? >>>>> >>>>> Thank you for reading and for your input ! >>>>> >>>>> >>>>> 2018-08-21 21:05:02.766969 7f10de377700 0 log_channel(cluster) log >>>>> [INF] : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 225 >>>>> TB >>>>> / 611 TB avail; 298 MB/s rd, 212 MB/s wr, 456 op/s >>>>> 2018-08-21 21:05:03.805124 7f10de377700 0 log_channel(cluster) log >>>>> [INF] : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 225 >>>>> TB >>>>> / 611 TB avail; 296 MB/s rd, 193 MB/s wr, 433 op/s >>>>> 2018-08-21 21:05:03.870690 7f10d83cc700 0 >>>>> mon.clb-cph-frpar2-mon-01@0(leader) >>>>> e6 handle_command mon_command({"var": "target_max_bytes", "prefix": "osd >>>>> pool set", "pool": "pool-cache-02", "val": "0"} v 0 >>>>> ) v1 >>>>> 2018-08-21 21:05:03.870749 7f10d83cc700 0 log_channel(audit) log >>>>> [INF] : from='client.146529236 -' entity='client.admin' cmd=[{"var": >>>>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-02 >>>>> ", "val": "0"}]: dispatch >>>>> 2018-08-21 21:05:03.913663 7f10d83cc700 0 >>>>> mon.clb-cph-frpar2-mon-01@0(leader) >>>>> e6 handle_command mon_command({"prefix": "osd pool stats", "format": >>>>> "json"} v 0) v1 >>>>> 2018-08-21 21:05:03.913745 7f10d83cc700 0 log_channel(audit) log >>>>> [DBG] : from='client.? 10.25.25.3:0/4101432770' entity='client.admin' >>>>> cmd=[{"prefix": "osd pool stats", "format": "json"}]: dispatch >>>>> 2018-08-21 21:05:04.042835 7f10d83cc700 0 >>>>> mon.clb-cph-frpar2-mon-01@0(leader) >>>>> e6 handle_command mon_command({"var": "target_max_bytes", "prefix": "osd >>>>> pool set", "pool": "pool-cache-01", "val": "0"} v 0 >>>>> ) v1 >>>>> 2018-08-21 21:05:04.042894 7f10d83cc700 0 log_channel(audit) log >>>>> [INF] : from='client.146565481 -' entity='client.admin' cmd=[{"var": >>>>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-01 >>>>> ", "val": "0"}]: dispatch >>>>> 2018-08-21 21:05:04.813217 7f10de377700 0 log_channel(cluster) log >>>>> [INF] : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 225 >>>>> TB >>>>> / 611 TB avail; 271 MB/s rd, 181 MB/s wr, 415 op/s >>>>> 2018-08-21 21:05:04.821486 7f10de377700 1 >>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd >>>>> e114693 e114693: 168 total, 168 up, 168 in >>>>> 2018-08-21 21:05:04.821711 7f10de377700 0 >>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd >>>>> e114693 crush map has features 432365734501548032, adjusting msgr requires >>>>> 2018-08-21 21:05:04.823032 7f10de377700 1 >>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd >>>>> e114693 create_pending setting backfillfull_ratio = 0.9 >>>>> 2018-08-21 21:05:04.823156 7f10de377700 0 log_channel(audit) log >>>>> [INF] : from='client.146529236 -' entity='client.admin' cmd='[{"var": >>>>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-0 >>>>> 2", "val": "0"}]': finished >>>>> 2018-08-21 21:05:04.823280 7f10de377700 0 log_channel(audit) log >>>>> [INF] : from='client.146565481 -' entity='client.admin' cmd='[{"var": >>>>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-0 >>>>> 1", "val": "0"}]': finished >>>>> 2018-08-21 21:05:04.825528 7f10de377700 0 log_channel(cluster) log >>>>> [DBG] : osdmap e114693: 168 total, 168 up, 168 in >>>>> 2018-08-21 21:05:04.844932 7f10de377700 0 log_channel(cluster) log >>>>> [INF] : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 225 >>>>> TB >>>>> / 611 TB avail; 275 MB/s rd, 192 MB/s wr, 429 op/s >>>>> 2018-08-21 21:05:05.873682 7f10de377700 0 log_channel(cluster) log >>>>> [INF] : pgmap 6208 pgs: 187 peering, 6021 active+clean; 261 TB data, 385 >>>>> TB >>>>> used, 225 TB / 611 TB avail; 220 MB/s rd, 188 MB/s wr, 383 >>>>> op/s >>>>> 2018-08-21 21:05:05.895414 7f10de377700 1 >>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd >>>>> e114694 e114694: 168 total, 168 up, 168 in >>>>> 2018-08-21 21:05:05.895721 7f10de377700 0 >>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd >>>>> e114694 crush map has features 432365734501548032, adjusting msgr requires >>>>> 2018-08-21 21:05:05.905473 7f10de377700 1 >>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd >>>>> e114694 create_pending setting backfillfull_ratio = 0.9 >>>>> 2018-08-21 21:05:05.940729 7f10de377700 0 log_channel(cluster) log >>>>> [DBG] : osdmap e114694: 168 total, 168 up, 168 in >>>>> 2018-08-21 21:05:05.947921 7f10de377700 0 log_channel(cluster) log >>>>> [INF] : pgmap 6208 pgs: 204 peering, 6004 active+clean; 261 TB data, 385 >>>>> TB >>>>> used, 225 TB / 611 TB avail; 201 MB/s rd, 200 MB/s wr, 355 >>>>> op/s >>>>> 2018-08-21 21:05:06.912752 7f10de377700 1 >>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd >>>>> e114695 e114695: 168 total, 168 up, 168 in >>>>> 2018-08-21 21:05:06.913104 7f10de377700 0 >>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd >>>>> e114695 crush map has features 432365734501548032, adjusting msgr requires >>>>> 2018-08-21 21:05:06.928521 7f10de377700 1 >>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd >>>>> e114695 create_pending setting backfillfull_ratio = 0.9 >>>>> 2018-08-21 21:05:06.968793 7f10de377700 0 log_channel(cluster) log >>>>> [DBG] : osdmap e114695: 168 total, 168 up, 168 in >>>>> 2018-08-21 21:05:07.001112 7f10d93ce700 1 heartbeat_map reset_timeout >>>>> 'Monitor::cpu_tp thread 0x7f10d93ce700' had timed out after 0 >>>>> 2018-08-21 21:05:07.004274 7f10de377700 0 log_channel(cluster) log >>>>> [INF] : pgmap 6208 pgs: 12 unknown, 114 remapped+peering, 5 activating, >>>>> 481 >>>>> peering, 5596 active+clean; 261 TB data, 385 TB used, 225 T >>>>> B / 611 TB avail; 177 MB/s rd, 173 MB/s wr, 317 op/s >>>>> 2018-08-21 21:05:07.025200 7f10d8bcd700 1 heartbeat_map reset_timeout >>>>> 'Monitor::cpu_tp thread 0x7f10d8bcd700' had timed out after 0 >>>>> 2018-08-21 21:05:07.948138 7f10de377700 1 >>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd >>>>> e114696 e114696: 168 total, 168 up, 168 in >>>>> 2018-08-21 21:05:07.948487 7f10de377700 0 >>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd >>>>> e114696 crush map has features 432365734501548032, adjusting msgr requires >>>>> 2018-08-21 21:05:07.965671 7f10de377700 1 >>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd >>>>> e114696 create_pending setting backfillfull_ratio = 0.9 >>>>> 2018-08-21 21:05:08.007585 7f10de377700 0 log_channel(cluster) log >>>>> [DBG] : osdmap e114696: 168 total, 168 up, 168 in >>>>> 2018-08-21 21:05:08.052153 7f10de377700 0 log_channel(cluster) log >>>>> [INF] : pgmap 6208 pgs: 46 activating+remapped, 15 unknown, 305 >>>>> remapped+peering, 11 activating, 697 peering, 8 active+remapped+backfil >>>>> ling, 5126 active+clean; 261 TB data, 385 TB used, 225 TB / 611 TB >>>>> avail; 115 MB/s rd, 182 MB/s wr, 292 op/s; 26681/519327262 objects >>>>> misplaced (0.005%); 4856 kB/s, 3 objects/s recovering >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> [email protected] >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>>
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
