Re: [ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

Adrien Gillard Thu, 23 Aug 2018 07:28:15 -0700

After upgrading to luminous, we see the exact same behaviour, with OSDs
eating as much as 80/90 GB of memory.


We'll try some memory profiling but at this point we're a bit lost. Is
there any specific logs that could help us ?

On Thu, Aug 23, 2018 at 2:34 PM Adrien Gillard <[email protected]>
wrote:

> Well after a few hours, still nothing new in the behaviour. With half of
> the OSDs (so 6 per host)  up and peering and the nodown flag set to limit
> the creation of new maps, all the memory is consumed and OSDs get killed by
> OOM killer.
> We observe a lot of threads being created for each OSDs (roughly between
> 500 and 900). As there is around 200 pg per osd, does this seem a lot to
> you ? We see tons being created each second, most probably to replace the
> threads that time out or disappear with the osd processes.
>
> We are preparing to upgrade the OSDs to Luminous.
>
> On Thu, Aug 23, 2018 at 4:05 AM Gregory Farnum <[email protected]> wrote:
>
>>
>>
>> On Wed, Aug 22, 2018 at 6:02 PM Adrien Gillard <[email protected]>
>> wrote:
>>
>>> We 'paused' the cluster early in our investigation to avoid unnecessary
>>> IO.
>>> We also set the nodown flag but the OOM rate was really sustained and we
>>> got servers that stopped responding from time to time, so we decided to
>>> lower the number of OSDs up and let them peer.
>>>
>>
>> Right but this often also means that they need to deal with changing
>> intervals or trying to set up recovery and things, plus the change in which
>> OSDs are up creates new maps that need to be processed. If you set “nodown”
>> you can incrementally bring up OSDs while doing a better job of throttling
>> the map creation process and preventing a bunch of new IO when/if they OOM.
>>
>>
>> I don't know if it is the best solution but it is the middle of the night
>>> here, so we'll see tomorrow morning and ponder the completion of the
>>> upgrade.
>>>
>>> I agree with you Greg, it does not make sense. The cluster was running
>>> great for around 6 hours after the mons upgrade...
>>>
>>> Finally, regarding the cache-tiering, I know it is not used a lot nor
>>> will it be maintained over time. One of the main goal of the luminous
>>> upgrade is to remove it completely and switch to ec rbd.
>>>
>>> Thank you for your inputs !
>>>
>>>
>>> On Wed, Aug 22, 2018 at 11:57 PM Gregory Farnum <[email protected]>
>>> wrote:
>>>
>>>> On Wed, Aug 22, 2018 at 6:35 AM Adrien Gillard <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> We have a hard time figuring out a behaviour encountered after
>>>>> upgrading the monitors of one of our cluster from Jewel to Luminous
>>>>> yesterday.
>>>>>
>>>>> The cluster is composed of 14 OSDs hosts (2xE5-2640 v3 and 64 GB of
>>>>> RAM), each containing 12x4TB OSD with journals on DC grade SSDs). The
>>>>> cluster was installed in Infernalis and upgraded to Jewel. We mainly use 
>>>>> it
>>>>> for RBD with a cache tier in front of an EC pool.
>>>>>
>>>>> We started the upgrade from Jewel 10.2.10 to Luminous 12.2.7 yesterday
>>>>> afternoon with the monitors and the provisionning of the MGR. We were
>>>>> planning on upgrading the OSD servers today.
>>>>> All went well until the cluster received the following commands, a few
>>>>> hours after the upgrade :
>>>>>
>>>>> ceph osd pool set pool-cache-01 target_max_bytes 0
>>>>> ceph osd pool set pool-cache-02 target_max_bytes 0
>>>>>
>>>>> (The context here is that we use this cluster for backups and set
>>>>> target_max_bytes to 0 to stop the flush and give maximum ressources to the
>>>>> backups IOs)
>>>>>
>>>>> Below is an extract from the logs of one of the monitors.
>>>>> The cluster started exchanging a lot over the cluster network,
>>>>> apparently backfill/recovery and multiple Gbits/s, in an amount which 
>>>>> we've
>>>>> never seen before. Due to some network uplinks sub-sizing, the traffic
>>>>> could not be handled well and this lead to OSDs crashing.
>>>>>
>>>>
>>>> This really doesn't make a ton of sense — you're saying that by telling
>>>> the cluster to *stop* doing flushing IO, everything went bad? In terms of
>>>> root causing the issue you'll obviously want to focus on that, although it
>>>> may or may not be helpful in recovering things at this point.
>>>>
>>>> Do keep in mind that users running cache-tiering are increasingly
>>>> uncommon, and while we still test it we aren't expanding that coverage and
>>>> would really like to remove and replace that functionality, so you're out
>>>> on the far edge of things here. :/
>>>>
>>>> Finally, if you're still experiencing OOM issues, you may want to set
>>>> the "nodown" flag — that will prevent OSDs being marked down, which
>>>> obviously will stall IO and peering if they go down anyway, but will
>>>> prevent spamming map updates that cause the cluster to OOM again.
>>>> -Greg
>>>>
>>>>
>>>>>
>>>>> At the moment we are still investigating the logs to try to understand
>>>>> what happened exactly. The cluster is down. When we restart the OSDs, PG
>>>>> start activating and peering but at some point all the RAM and swap of the
>>>>> servers is consumed and OSDs crash.
>>>>> I tried to set the ovious cluster parameters
>>>>> (osd_recovery_max_active=1, osd_recovery_threads=1, osd_max_backfills=1)
>>>>> and norecover, nobackfill, but when the OSDs start, we still see 4 or 5
>>>>> Gbits/s over the cluster network.
>>>>>
>>>>> We are currently reconfiguring the network to try to remove the
>>>>> bandwith problem and are preparing to add more RAM to the servers.
>>>>>
>>>>> But still, I have a couple of questions :
>>>>>
>>>>>  - Has anyone encoutered the issue we have with the cache ? Or have a
>>>>> possible explanation ?
>>>>>  - Is it normal to observe so much traffic on the cluster network when
>>>>> the PG are activating / peering with norecover and nobackfill set ?
>>>>>
>>>>> Thank you for reading and for your input !
>>>>>
>>>>>
>>>>> 2018-08-21 21:05:02.766969 7f10de377700  0 log_channel(cluster) log
>>>>> [INF] : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 225 
>>>>> TB
>>>>> / 611 TB avail; 298 MB/s rd, 212 MB/s wr, 456 op/s
>>>>> 2018-08-21 21:05:03.805124 7f10de377700  0 log_channel(cluster) log
>>>>> [INF] : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 225 
>>>>> TB
>>>>> / 611 TB avail; 296 MB/s rd, 193 MB/s wr, 433 op/s
>>>>> 2018-08-21 21:05:03.870690 7f10d83cc700  0 
>>>>> mon.clb-cph-frpar2-mon-01@0(leader)
>>>>> e6 handle_command mon_command({"var": "target_max_bytes", "prefix": "osd
>>>>> pool set", "pool": "pool-cache-02", "val": "0"} v 0
>>>>> ) v1
>>>>> 2018-08-21 21:05:03.870749 7f10d83cc700  0 log_channel(audit) log
>>>>> [INF] : from='client.146529236 -' entity='client.admin' cmd=[{"var":
>>>>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-02
>>>>> ", "val": "0"}]: dispatch
>>>>> 2018-08-21 21:05:03.913663 7f10d83cc700  0 
>>>>> mon.clb-cph-frpar2-mon-01@0(leader)
>>>>> e6 handle_command mon_command({"prefix": "osd pool stats", "format":
>>>>> "json"} v 0) v1
>>>>> 2018-08-21 21:05:03.913745 7f10d83cc700  0 log_channel(audit) log
>>>>> [DBG] : from='client.? 10.25.25.3:0/4101432770' entity='client.admin'
>>>>> cmd=[{"prefix": "osd pool stats", "format": "json"}]: dispatch
>>>>> 2018-08-21 21:05:04.042835 7f10d83cc700  0 
>>>>> mon.clb-cph-frpar2-mon-01@0(leader)
>>>>> e6 handle_command mon_command({"var": "target_max_bytes", "prefix": "osd
>>>>> pool set", "pool": "pool-cache-01", "val": "0"} v 0
>>>>> ) v1
>>>>> 2018-08-21 21:05:04.042894 7f10d83cc700  0 log_channel(audit) log
>>>>> [INF] : from='client.146565481 -' entity='client.admin' cmd=[{"var":
>>>>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-01
>>>>> ", "val": "0"}]: dispatch
>>>>> 2018-08-21 21:05:04.813217 7f10de377700  0 log_channel(cluster) log
>>>>> [INF] : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 225 
>>>>> TB
>>>>> / 611 TB avail; 271 MB/s rd, 181 MB/s wr, 415 op/s
>>>>> 2018-08-21 21:05:04.821486 7f10de377700  1 
>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>> e114693 e114693: 168 total, 168 up, 168 in
>>>>> 2018-08-21 21:05:04.821711 7f10de377700  0 
>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>> e114693 crush map has features 432365734501548032, adjusting msgr requires
>>>>> 2018-08-21 21:05:04.823032 7f10de377700  1 
>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>> e114693 create_pending setting backfillfull_ratio = 0.9
>>>>> 2018-08-21 21:05:04.823156 7f10de377700  0 log_channel(audit) log
>>>>> [INF] : from='client.146529236 -' entity='client.admin' cmd='[{"var":
>>>>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-0
>>>>> 2", "val": "0"}]': finished
>>>>> 2018-08-21 21:05:04.823280 7f10de377700  0 log_channel(audit) log
>>>>> [INF] : from='client.146565481 -' entity='client.admin' cmd='[{"var":
>>>>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-0
>>>>> 1", "val": "0"}]': finished
>>>>> 2018-08-21 21:05:04.825528 7f10de377700  0 log_channel(cluster) log
>>>>> [DBG] : osdmap e114693: 168 total, 168 up, 168 in
>>>>> 2018-08-21 21:05:04.844932 7f10de377700  0 log_channel(cluster) log
>>>>> [INF] : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 225 
>>>>> TB
>>>>> / 611 TB avail; 275 MB/s rd, 192 MB/s wr, 429 op/s
>>>>> 2018-08-21 21:05:05.873682 7f10de377700  0 log_channel(cluster) log
>>>>> [INF] : pgmap 6208 pgs: 187 peering, 6021 active+clean; 261 TB data, 385 
>>>>> TB
>>>>> used, 225 TB / 611 TB avail; 220 MB/s rd, 188 MB/s wr, 383
>>>>> op/s
>>>>> 2018-08-21 21:05:05.895414 7f10de377700  1 
>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>> e114694 e114694: 168 total, 168 up, 168 in
>>>>> 2018-08-21 21:05:05.895721 7f10de377700  0 
>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>> e114694 crush map has features 432365734501548032, adjusting msgr requires
>>>>> 2018-08-21 21:05:05.905473 7f10de377700  1 
>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>> e114694 create_pending setting backfillfull_ratio = 0.9
>>>>> 2018-08-21 21:05:05.940729 7f10de377700  0 log_channel(cluster) log
>>>>> [DBG] : osdmap e114694: 168 total, 168 up, 168 in
>>>>> 2018-08-21 21:05:05.947921 7f10de377700  0 log_channel(cluster) log
>>>>> [INF] : pgmap 6208 pgs: 204 peering, 6004 active+clean; 261 TB data, 385 
>>>>> TB
>>>>> used, 225 TB / 611 TB avail; 201 MB/s rd, 200 MB/s wr, 355
>>>>> op/s
>>>>> 2018-08-21 21:05:06.912752 7f10de377700  1 
>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>> e114695 e114695: 168 total, 168 up, 168 in
>>>>> 2018-08-21 21:05:06.913104 7f10de377700  0 
>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>> e114695 crush map has features 432365734501548032, adjusting msgr requires
>>>>> 2018-08-21 21:05:06.928521 7f10de377700  1 
>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>> e114695 create_pending setting backfillfull_ratio = 0.9
>>>>> 2018-08-21 21:05:06.968793 7f10de377700  0 log_channel(cluster) log
>>>>> [DBG] : osdmap e114695: 168 total, 168 up, 168 in
>>>>> 2018-08-21 21:05:07.001112 7f10d93ce700  1 heartbeat_map reset_timeout
>>>>> 'Monitor::cpu_tp thread 0x7f10d93ce700' had timed out after 0
>>>>> 2018-08-21 21:05:07.004274 7f10de377700  0 log_channel(cluster) log
>>>>> [INF] : pgmap 6208 pgs: 12 unknown, 114 remapped+peering, 5 activating, 
>>>>> 481
>>>>> peering, 5596 active+clean; 261 TB data, 385 TB used, 225 T
>>>>> B / 611 TB avail; 177 MB/s rd, 173 MB/s wr, 317 op/s
>>>>> 2018-08-21 21:05:07.025200 7f10d8bcd700  1 heartbeat_map reset_timeout
>>>>> 'Monitor::cpu_tp thread 0x7f10d8bcd700' had timed out after 0
>>>>> 2018-08-21 21:05:07.948138 7f10de377700  1 
>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>> e114696 e114696: 168 total, 168 up, 168 in
>>>>> 2018-08-21 21:05:07.948487 7f10de377700  0 
>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>> e114696 crush map has features 432365734501548032, adjusting msgr requires
>>>>> 2018-08-21 21:05:07.965671 7f10de377700  1 
>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd
>>>>> e114696 create_pending setting backfillfull_ratio = 0.9
>>>>> 2018-08-21 21:05:08.007585 7f10de377700  0 log_channel(cluster) log
>>>>> [DBG] : osdmap e114696: 168 total, 168 up, 168 in
>>>>> 2018-08-21 21:05:08.052153 7f10de377700  0 log_channel(cluster) log
>>>>> [INF] : pgmap 6208 pgs: 46 activating+remapped, 15 unknown, 305
>>>>> remapped+peering, 11 activating, 697 peering, 8 active+remapped+backfil
>>>>> ling, 5126 active+clean; 261 TB data, 385 TB used, 225 TB / 611 TB
>>>>> avail; 115 MB/s rd, 182 MB/s wr, 292 op/s; 26681/519327262 objects
>>>>> misplaced (0.005%); 4856 kB/s, 3 objects/s recovering
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> [email protected]
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

Reply via email to