Re: [ceph-users] SSD Bluestore Backfills Slow

Reed Dier Mon, 26 Feb 2018 14:24:11 -0800

Quick turn around,

Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s on 
bluestore opened the floodgates.


> pool objects-ssd id 20
>   recovery io 1512 MB/s, 21547 objects/s
> 
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 6494 keys/s, 271 objects/s
>   client io 82325 B/s rd, 68146 B/s wr, 1 op/s rd, 0 op/s wr

Graph of performance jump. Extremely marked.
https://imgur.com/a/LZR9R <https://imgur.com/a/LZR9R>

So at least we now have the gun to go with the smoke.

Thanks for the help and appreciate you pointing me in some directions that I 
was able to use to figure out the issue.

Adding to ceph.conf for future OSD conversions.

Thanks,

Reed


> On Feb 26, 2018, at 4:12 PM, Reed Dier <[email protected]> wrote:
> 
> For the record, I am not seeing a demonstrative fix by injecting the value of 
> 0 into the OSDs running.
>> osd_recovery_sleep_hybrid = '0.000000' (not observed, change may require 
>> restart)
> 
> If it does indeed need to be restarted, I will need to wait for the current 
> backfills to finish their process as restarting an OSD would bring me under 
> min_size.
> 
> However, doing config show on the osd daemon appears to have taken the value 
> of 0.
> 
>> ceph daemon osd.24 config show | grep recovery_sleep
>>     "osd_recovery_sleep": "0.000000",
>>     "osd_recovery_sleep_hdd": "0.100000",
>>     "osd_recovery_sleep_hybrid": "0.000000",
>>     "osd_recovery_sleep_ssd": "0.000000",
> 
> 
> I may take the restart as an opportunity to also move to 12.2.3 at the same 
> time, since it is not expected that that should affect this issue.
> 
> I could also attempt to change osd_recovery_sleep_hdd as well, since these 
> are ssd osd’s, it shouldn’t make a difference, but its a free move.
> 
> Thanks,
> 
> Reed
> 
>> On Feb 26, 2018, at 3:42 PM, Gregory Farnum <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> On Mon, Feb 26, 2018 at 12:26 PM Reed Dier <[email protected] 
>> <mailto:[email protected]>> wrote:
>> I will try to set the hybrid sleeps to 0 on the affected OSDs as an interim 
>> solution to getting the metadata configured correctly.
>> 
>> Yes, that's a good workaround as long as you don't have any actual hybrid 
>> OSDs (or aren't worried about them sleeping...I'm not sure if that setting 
>> came from experience or not).
>>  
>> 
>> For reference, here is the complete metadata for osd.24, bluestore SATA SSD 
>> with NVMe block.db.
>> 
>>> {
>>>         "id": 24,
>>>         "arch": "x86_64",
>>>         "back_addr": "",
>>>         "back_iface": "bond0",
>>>         "bluefs": "1",
>>>         "bluefs_db_access_mode": "blk",
>>>         "bluefs_db_block_size": "4096",
>>>         "bluefs_db_dev": "259:0",
>>>         "bluefs_db_dev_node": "nvme0n1",
>>>         "bluefs_db_driver": "KernelDevice",
>>>         "bluefs_db_model": "INTEL SSDPEDMD400G4                     ",
>>>         "bluefs_db_partition_path": "/dev/nvme0n1p4",
>>>         "bluefs_db_rotational": "0",
>>>         "bluefs_db_serial": " ",
>>>         "bluefs_db_size": "16000221184",
>>>         "bluefs_db_type": "nvme",
>>>         "bluefs_single_shared_device": "0",
>>>         "bluefs_slow_access_mode": "blk",
>>>         "bluefs_slow_block_size": "4096",
>>>         "bluefs_slow_dev": "253:8",
>>>         "bluefs_slow_dev_node": "dm-8",
>>>         "bluefs_slow_driver": "KernelDevice",
>>>         "bluefs_slow_model": "",
>>>         "bluefs_slow_partition_path": "/dev/dm-8",
>>>         "bluefs_slow_rotational": "0",
>>>         "bluefs_slow_size": "1920378863616",
>>>         "bluefs_slow_type": "ssd",
>>>         "bluestore_bdev_access_mode": "blk",
>>>         "bluestore_bdev_block_size": "4096",
>>>         "bluestore_bdev_dev": "253:8",
>>>         "bluestore_bdev_dev_node": "dm-8",
>>>         "bluestore_bdev_driver": "KernelDevice",
>>>         "bluestore_bdev_model": "",
>>>         "bluestore_bdev_partition_path": "/dev/dm-8",
>>>         "bluestore_bdev_rotational": "0",
>>>         "bluestore_bdev_size": "1920378863616",
>>>         "bluestore_bdev_type": "ssd",
>>>         "ceph_version": "ceph version 12.2.2 
>>> (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)",
>>>         "cpu": "Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz",
>>>         "default_device_class": "ssd",
>>>         "distro": "ubuntu",
>>>         "distro_description": "Ubuntu 16.04.3 LTS",
>>>         "distro_version": "16.04",
>>>         "front_addr": "",
>>>         "front_iface": "bond0",
>>>         "hb_back_addr": "",
>>>         "hb_front_addr": "",
>>>         "hostname": “host00",
>>>         "journal_rotational": "1",
>>>         "kernel_description": "#29~16.04.2-Ubuntu SMP Tue Jan 9 22:00:44 
>>> UTC 2018",
>>>         "kernel_version": "4.13.0-26-generic",
>>>         "mem_swap_kb": "124999672",
>>>         "mem_total_kb": "131914008",
>>>         "os": "Linux",
>>>         "osd_data": "/var/lib/ceph/osd/ceph-24",
>>>         "osd_objectstore": "bluestore",
>>>         "rotational": "0"
>>>     }
>> 
>> 
>> So it looks like it correctly guessed(?) the 
>> bluestore_bdev_type/default_device_class correctly (though it may have been 
>> an inherited value?), as did bluefs_db_type get set to nvme correctly.
>> 
>> So I’m not sure why journal_rotational is still showing 1.
>> Maybe something in the ceph-volume lvm piece that isn’t correctly setting 
>> that flag on OSD creation?
>> Also seems like the journal_rotational field should have been deprecated in 
>> bluestore as bluefs_db_rotational should cover that, and if there were a WAL 
>> partition as well, I assume there would be something to the tune of 
>> bluefs_wal_rotational or something like that, and journal would never be 
>> used for bluestore?
>> 
>> Thanks to both of you for helping diagnose this issue. I created a ticket 
>> and have a PR up to fix it: http://tracker.ceph.com/issues/23141 
>> <http://tracker.ceph.com/issues/23141>, 
>> https://github.com/ceph/ceph/pull/20602 
>> <https://github.com/ceph/ceph/pull/20602>
>> 
>> Until that gets backported into another Luminous release you'll need to do 
>> some kind of workaround though. :/
>> -Greg
>>  
>> 
>> Appreciate the help.
>> 
>> Thanks,
>> Reed
>> 
>>> On Feb 26, 2018, at 1:28 PM, Gregory Farnum <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> On Mon, Feb 26, 2018 at 11:21 AM Reed Dier <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> The ‘good perf’ that I reported below was the result of beginning 5 new 
>>> bluestore conversions which results in a leading edge of ‘good’ 
>>> performance, before trickling off.
>>> 
>>> This performance lasted about 20 minutes, where it backfilled a small set 
>>> of PGs off of non-bluestore OSDs.
>>> 
>>> Current performance is now hovering around:
>>>> pool objects-ssd id 20
>>>>   recovery io 14285 kB/s, 202 objects/s
>>>> 
>>>> pool fs-metadata-ssd id 16
>>>>   recovery io 0 B/s, 262 keys/s, 12 objects/s
>>>>   client io 412 kB/s rd, 67593 B/s wr, 5 op/s rd, 0 op/s wr
>>> 
>>>> What are you referencing when you talk about recovery ops per second?
>>> 
>>> These are recovery ops as reported by ceph -s or via stats exported via 
>>> influx plugin in mgr, and via local collectd collection.
>>> 
>>>> Also, what are the values for osd_recovery_sleep_hdd and 
>>>> osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata" 
>>>> that your BlueStore SSD OSDs are correctly reporting both themselves and 
>>>> their journals as non-rotational?
>>> 
>>> This yields more interesting results.
>>> Pasting results for 3 sets of OSDs in this order
>>>  {0}hdd+nvme block.db
>>> {24}ssd+nvme block.db
>>> {59}ssd+nvme journal
>>> 
>>>> ceph osd metadata | grep 'id\|rotational'
>>>> "id": 0,
>>>>         "bluefs_db_rotational": "0",
>>>>         "bluefs_slow_rotational": "1",
>>>>         "bluestore_bdev_rotational": "1",
>>>>         "journal_rotational": "1",
>>>>         "rotational": “1"
>>>> "id": 24,
>>>>         "bluefs_db_rotational": "0",
>>>>         "bluefs_slow_rotational": "0",
>>>>         "bluestore_bdev_rotational": "0",
>>>>         "journal_rotational": "1",
>>>>         "rotational": “0"
>>>> "id": 59,
>>>>         "journal_rotational": "0",
>>>>         "rotational": “0"
>>> 
>>> I wonder if it matters/is correct to see "journal_rotational": “1” for the 
>>> bluestore OSD’s {0,24} with nvme block.db.
>>> 
>>> Hope this may be helpful in determining the root cause.
>>> 
>>> If you have an SSD main store and a hard drive ("rotational") journal, the 
>>> OSD will insert recovery sleeps from the osd_recovery_sleep_hybrid config 
>>> option. By default that is .025 (seconds).
>>> 
>>> I believe you can override the setting (I'm not sure how), but you really 
>>> want to correct that flag at the OS layer. Generally when we see this 
>>> there's a RAID card or something between the solid-state device and the 
>>> host which is lying about the state of the world.
>>> -Greg
>>>  
>>> 
>>> If it helps, all of the OSD’s were originally deployed with ceph-deploy, 
>>> but are now being redone with ceph-volume locally on each host.
>>> 
>>> Thanks,
>>> 
>>> Reed
>>> 
>>>> On Feb 26, 2018, at 1:00 PM, Gregory Farnum <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> On Mon, Feb 26, 2018 at 9:12 AM Reed Dier <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> After my last round of backfills completed, I started 5 more bluestore 
>>>> conversions, which helped me recognize a very specific pattern of 
>>>> performance.
>>>> 
>>>>> pool objects-ssd id 20
>>>>>   recovery io 757 MB/s, 10845 objects/s
>>>>> 
>>>>> pool fs-metadata-ssd id 16
>>>>>   recovery io 0 B/s, 36265 keys/s, 1633 objects/s
>>>>>   client io 2544 kB/s rd, 36788 B/s wr, 1 op/s rd, 0 op/s wr
>>>> 
>>>> The “non-throttled” backfills are only coming from filestore SSD OSD’s.
>>>> When backfilling from bluestore SSD OSD’s, they appear to be throttled at 
>>>> the aforementioned <20 ops per OSD.
>>>> 
>>>> Wait, is that the current state? What are you referencing when you talk 
>>>> about recovery ops per second?
>>>> 
>>>> Also, what are the values for osd_recovery_sleep_hdd and 
>>>> osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata" 
>>>> that your BlueStore SSD OSDs are correctly reporting both themselves and 
>>>> their journals as non-rotational?
>>>> -Greg
>>>>  
>>>> 
>>>> This would corroborate why the first batch of SSD’s I migrated to 
>>>> bluestore were all at “full” speed, as all of the OSD’s they were 
>>>> backfilling from were filestore based, compared to increasingly bluestore 
>>>> backfill targets, leading to increasingly long backfill times as I move 
>>>> from one host to the next.
>>>> 
>>>> Looking at the recovery settings, the recovery_sleep and 
>>>> recovery_sleep_ssd values across bluestore or filestore OSDs are showing 
>>>> as 0 values, which means no sleep/throttle if I am reading everything 
>>>> correctly.
>>>> 
>>>>> sudo ceph daemon osd.73 config show | grep recovery
>>>>>     "osd_allow_recovery_below_min_size": "true",
>>>>>     "osd_debug_skip_full_check_in_recovery": "false",
>>>>>     "osd_force_recovery_pg_log_entries_factor": "1.300000",
>>>>>     "osd_min_recovery_priority": "0",
>>>>>     "osd_recovery_cost": "20971520",
>>>>>     "osd_recovery_delay_start": "0.000000",
>>>>>     "osd_recovery_forget_lost_objects": "false",
>>>>>     "osd_recovery_max_active": "35",
>>>>>     "osd_recovery_max_chunk": "8388608",
>>>>>     "osd_recovery_max_omap_entries_per_chunk": "64000",
>>>>>     "osd_recovery_max_single_start": "1",
>>>>>     "osd_recovery_op_priority": "3",
>>>>>     "osd_recovery_op_warn_multiple": "16",
>>>>>     "osd_recovery_priority": "5",
>>>>>     "osd_recovery_retry_interval": "30.000000",
>>>>>     "osd_recovery_sleep": "0.000000",
>>>>>     "osd_recovery_sleep_hdd": "0.100000",
>>>>>     "osd_recovery_sleep_hybrid": "0.025000",
>>>>>     "osd_recovery_sleep_ssd": "0.000000",
>>>>>     "osd_recovery_thread_suicide_timeout": "300",
>>>>>     "osd_recovery_thread_timeout": "30",
>>>>>     "osd_scrub_during_recovery": "false",
>>>> 
>>>> 
>>>> As far as I know, the device class is configured correctly as far as I 
>>>> know, it all shows as ssd/hdd correctly in ceph osd tree.
>>>> 
>>>> So hopefully this may be enough of a smoking gun to help narrow down where 
>>>> this may be stemming from.
>>>> 
>>>> Thanks,
>>>> 
>>>> Reed
>>>> 
>>>>> On Feb 23, 2018, at 10:04 AM, David Turner <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> Here is a [1] link to a ML thread tracking some slow backfilling on 
>>>>> bluestore.  It came down to the backfill sleep setting for them.  Maybe 
>>>>> it will help.
>>>>> 
>>>>> [1] https://www.mail-archive.com/[email protected]/msg40256.html 
>>>>> <https://www.mail-archive.com/[email protected]/msg40256.html>
>>>>> On Fri, Feb 23, 2018 at 10:46 AM Reed Dier <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> Probably unrelated, but I do keep seeing this odd negative objects 
>>>>> degraded message on the fs-metadata pool:
>>>>> 
>>>>>> pool fs-metadata-ssd id 16
>>>>>>   -34/3 objects degraded (-1133.333%)
>>>>>>   recovery io 0 B/s, 89 keys/s, 2 objects/s
>>>>>>   client io 51289 B/s rd, 101 kB/s wr, 0 op/s rd, 0 op/s wr
>>>>> 
>>>>> Don’t mean to clutter the ML/thread, however it did seem odd, maybe its a 
>>>>> culprit? Maybe its some weird sampling interval issue thats been solved 
>>>>> in 12.2.3?
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Reed
>>>>> 
>>>>> 
>>>>>> On Feb 23, 2018, at 8:26 AM, Reed Dier <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> Below is ceph -s
>>>>>> 
>>>>>>>   cluster:
>>>>>>>     id:     {id}
>>>>>>>     health: HEALTH_WARN
>>>>>>>             noout flag(s) set
>>>>>>>             260610/1068004947 objects misplaced (0.024%)
>>>>>>>             Degraded data redundancy: 23157232/1068004947 objects 
>>>>>>> degraded (2.168%), 332 pgs unclean, 328 pgs degraded, 328 pgs undersized
>>>>>>> 
>>>>>>>   services:
>>>>>>>     mon: 3 daemons, quorum mon02,mon01,mon03
>>>>>>>     mgr: mon03(active), standbys: mon02
>>>>>>>     mds: cephfs-1/1/1 up  {0=mon03=up:active}, 1 up:standby
>>>>>>>     osd: 74 osds: 74 up, 74 in; 332 remapped pgs
>>>>>>>          flags noout
>>>>>>> 
>>>>>>>   data:
>>>>>>>     pools:   5 pools, 5316 pgs
>>>>>>>     objects: 339M objects, 46627 GB
>>>>>>>     usage:   154 TB used, 108 TB / 262 TB avail
>>>>>>>     pgs:     23157232/1068004947 objects degraded (2.168%)
>>>>>>>              260610/1068004947 objects misplaced (0.024%)
>>>>>>>              4984 active+clean
>>>>>>>              183  active+undersized+degraded+remapped+backfilling
>>>>>>>              145  active+undersized+degraded+remapped+backfill_wait
>>>>>>>              3    active+remapped+backfill_wait
>>>>>>>              1    active+remapped+backfilling
>>>>>>> 
>>>>>>>   io:
>>>>>>>     client:   8428 kB/s rd, 47905 B/s wr, 130 op/s rd, 0 op/s wr
>>>>>>>     recovery: 37057 kB/s, 50 keys/s, 217 objects/s
>>>>>> 
>>>>>> Also the two pools on the SSDs, are the objects pool at 4096 PG, and the 
>>>>>> fs-metadata pool at 32 PG.
>>>>>> 
>>>>>>> Are you sure the recovery is actually going slower, or are the 
>>>>>>> individual ops larger or more expensive?
>>>>>> 
>>>>>> The objects should not vary wildly in size.
>>>>>> Even if they were differing in size, the SSDs are roughly idle in their 
>>>>>> current state of backfilling when examining wait in iotop, or atop, or 
>>>>>> sysstat/iostat.
>>>>>> 
>>>>>> This compares to when I was fully saturating the SATA backplane with 
>>>>>> over 1000MB/s of writes to multiple disks when the backfills were going 
>>>>>> “full speed.”
>>>>>> 
>>>>>> Here is a breakdown of recovery io by pool:
>>>>>> 
>>>>>>> pool objects-ssd id 20
>>>>>>>   recovery io 6779 kB/s, 92 objects/s
>>>>>>>   client io 3071 kB/s rd, 50 op/s rd, 0 op/s wr
>>>>>>> 
>>>>>>> pool fs-metadata-ssd id 16
>>>>>>>   recovery io 0 B/s, 28 keys/s, 2 objects/s
>>>>>>>   client io 109 kB/s rd, 67455 B/s wr, 1 op/s rd, 0 op/s wr
>>>>>>> 
>>>>>>> pool cephfs-hdd id 17
>>>>>>>   recovery io 40542 kB/s, 158 objects/s
>>>>>>>   client io 10056 kB/s rd, 142 op/s rd, 0 op/s wr
>>>>>> 
>>>>>> So the 24 HDD’s are outperforming the 50 SSD’s for recovery and client 
>>>>>> traffic at the moment, which seems conspicuous to me.
>>>>>> 
>>>>>> Most of the OSD’s with recovery ops to the SSDs are reporting 8-12 ops, 
>>>>>> with one OSD occasionally spiking up to 300-500 for a few minutes. Stats 
>>>>>> being pulled by both local CollectD instances on each node, as well as 
>>>>>> the Influx plugin in MGR as we evaluate that against collectd.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Reed
>>>>>> 
>>>>>> 
>>>>>>> On Feb 22, 2018, at 6:21 PM, Gregory Farnum <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> What's the output of "ceph -s" while this is happening?
>>>>>>> 
>>>>>>> Is there some identifiable difference between these two states, like 
>>>>>>> you get a lot of throughput on the data pools but then metadata 
>>>>>>> recovery is slower?
>>>>>>> 
>>>>>>> Are you sure the recovery is actually going slower, or are the 
>>>>>>> individual ops larger or more expensive?
>>>>>>> 
>>>>>>> My WAG is that recovering the metadata pool, composed mostly of 
>>>>>>> directories stored in omap objects, is going much slower for some 
>>>>>>> reason. You can adjust the cost of those individual ops some by 
>>>>>>> changing osd_recovery_max_omap_entries_per_chunk (default: 8096), but 
>>>>>>> I'm not sure which way you want to go or indeed if this has anything to 
>>>>>>> do with the problem you're seeing. (eg, it could be that reading out 
>>>>>>> the omaps is expensive, so you can get higher recovery op numbers by 
>>>>>>> turning down the number of entries per request, but not actually see 
>>>>>>> faster backfilling because you have to issue more requests.)
>>>>>>> -Greg
>>>>>>> 
>>>>>>> On Wed, Feb 21, 2018 at 2:57 PM Reed Dier <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I am running into an odd situation that I cannot easily explain.
>>>>>>> I am currently in the midst of destroy and rebuild of OSDs from 
>>>>>>> filestore to bluestore.
>>>>>>> With my HDDs, I am seeing expected behavior, but with my SSDs I am 
>>>>>>> seeing unexpected behavior. The HDDs and SSDs are set in crush 
>>>>>>> accordingly.
>>>>>>> 
>>>>>>> My path to replacing the OSDs is to set the noout, norecover, 
>>>>>>> norebalance flag, destroy the OSD, create the OSD back, (iterate n 
>>>>>>> times, all within a single failure domain), unset the flags, and let it 
>>>>>>> go. It finishes, rinse, repeat.
>>>>>>> 
>>>>>>> For the SSD OSDs, they are SATA SSDs (Samsung SM863a) , 10 to a node, 
>>>>>>> with 2 NVMe drives (Intel P3700), 5 SATA SSDs to 1 NVMe drive, 16G 
>>>>>>> partitions for block.db (previously filestore journals).
>>>>>>> 2x10GbE networking between the nodes. SATA backplane caps out at around 
>>>>>>> 10 Gb/s as its 2x 6 Gb/s controllers. Luminous 12.2.2.
>>>>>>> 
>>>>>>> When the flags are unset, recovery starts and I see a very large rush 
>>>>>>> of traffic, however, after the first machine completed, the performance 
>>>>>>> tapered off at a rapid pace and trickles. Comparatively, I’m getting 
>>>>>>> 100-200 recovery ops on 3 HDDs, backfilling from 21 other HDDs, where 
>>>>>>> as I’m getting 150-250 recovery ops on 5 SSDs, backfilling from 40 
>>>>>>> other SSDs. Every once in a while I will see a spike up to 500, 1000, 
>>>>>>> or even 2000 ops on the SSDs, often a few hundred recovery ops from one 
>>>>>>> OSD, and 8-15 ops from the others that are backfilling.
>>>>>>> 
>>>>>>> This is a far cry from the more than 15-30k recovery ops that it 
>>>>>>> started off recovering with 1-3k recovery ops from a single OSD to the 
>>>>>>> backfilling OSD(s). And an even farther cry from the >15k recovery ops 
>>>>>>> I was sustaining for over an hour or more before. I was able to rebuild 
>>>>>>> a 1.9T SSD (1.1T used) in a little under an hour, and I could do about 
>>>>>>> 5 at a time and still keep it at roughly an hour to backfill all of 
>>>>>>> them, but then I hit a roadblock after the first machine, when I tried 
>>>>>>> to do 10 at a time (single machine). I am now still experiencing the 
>>>>>>> same thing on the third node, while doing 5 OSDs at a time. 
>>>>>>> 
>>>>>>> The pools associated with these SSDs are cephfs-metadata, as well as a 
>>>>>>> pure rados object pool we use for our own internal applications. Both 
>>>>>>> are size=3, min_size=2.
>>>>>>> 
>>>>>>> It appears I am not the first to run into this, but it looks like there 
>>>>>>> was no resolution: 
>>>>>>> https://www.spinics.net/lists/ceph-users/msg41493.html 
>>>>>>> <https://www.spinics.net/lists/ceph-users/msg41493.html>
>>>>>>> 
>>>>>>> Recovery parameters for the OSDs match what was in the previous thread, 
>>>>>>> sans the osd conf block listed. And current osd_max_backfills = 30 and 
>>>>>>> osd_recovery_max_active = 35. Very little activity on the OSDs during 
>>>>>>> this period, so should not be any contention for iops on the SSDs.
>>>>>>> 
>>>>>>> The only oddity that I can attribute to things is that we had a few 
>>>>>>> periods of time where the disk load on one of the mons was high enough 
>>>>>>> to cause the mon to drop out of quorum for a brief amount of time, a 
>>>>>>> few times. But I wouldn’t think backfills would just get throttled due 
>>>>>>> to mons flapping.
>>>>>>> 
>>>>>>> Hopefully someone has some experience or can steer me in a path to 
>>>>>>> improve the performance of the backfills so that I’m not stuck in 
>>>>>>> backfill purgatory longer than I need to be.
>>>>>>> 
>>>>>>> Linking an imgur album with some screen grabs of the recovery ops over 
>>>>>>> time for the first machine, versus the second and third machines to 
>>>>>>> demonstrate the delta between them.
>>>>>>> https://imgur.com/a/OJw4b <https://imgur.com/a/OJw4b>
>>>>>>> 
>>>>>>> Also including a ceph osd df of the SSDs, highlighted in red are the 
>>>>>>> OSDs currently backfilling. Could this possibly be PG overdose? I don’t 
>>>>>>> ever run into ‘stuck activating’ PGs, its just painfully slow 
>>>>>>> backfills, like they are being throttled by ceph, that are causing me 
>>>>>>> to worry. Drives aren’t worn, <30 P/E cycles on the drives, so plenty 
>>>>>>> of life left in them.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Reed
>>>>>>> 
>>>>>>>> $ ceph osd df
>>>>>>>> ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
>>>>>>>> 24   ssd 1.76109  1.00000 1803G 1094G  708G 60.69 1.08 260
>>>>>>>> 25   ssd 1.76109  1.00000 1803G 1136G  667G 63.01 1.12 271
>>>>>>>> 26   ssd 1.76109  1.00000 1803G 1018G  785G 56.46 1.01 243
>>>>>>>> 27   ssd 1.76109  1.00000 1803G 1065G  737G 59.10 1.05 253
>>>>>>>> 28   ssd 1.76109  1.00000 1803G 1026G  776G 56.94 1.02 245
>>>>>>>> 29   ssd 1.76109  1.00000 1803G 1132G  671G 62.79 1.12 270
>>>>>>>> 30   ssd 1.76109  1.00000 1803G  944G  859G 52.35 0.93 224
>>>>>>>> 31   ssd 1.76109  1.00000 1803G 1061G  742G 58.85 1.05 252
>>>>>>>> 32   ssd 1.76109  1.00000 1803G 1003G  799G 55.67 0.99 239
>>>>>>>> 33   ssd 1.76109  1.00000 1803G 1049G  753G 58.20 1.04 250
>>>>>>>> 34   ssd 1.76109  1.00000 1803G 1086G  717G 60.23 1.07 257
>>>>>>>> 35   ssd 1.76109  1.00000 1803G  978G  824G 54.26 0.97 232
>>>>>>>> 36   ssd 1.76109  1.00000 1803G 1057G  745G 58.64 1.05 252
>>>>>>>> 37   ssd 1.76109  1.00000 1803G 1025G  777G 56.88 1.01 244
>>>>>>>> 38   ssd 1.76109  1.00000 1803G 1047G  756G 58.06 1.04 250
>>>>>>>> 39   ssd 1.76109  1.00000 1803G 1031G  771G 57.20 1.02 246
>>>>>>>> 40   ssd 1.76109  1.00000 1803G 1029G  774G 57.07 1.02 245
>>>>>>>> 41   ssd 1.76109  1.00000 1803G 1033G  770G 57.28 1.02 245
>>>>>>>> 42   ssd 1.76109  1.00000 1803G  993G  809G 55.10 0.98 236
>>>>>>>> 43   ssd 1.76109  1.00000 1803G 1072G  731G 59.45 1.06 256
>>>>>>>> 44   ssd 1.76109  1.00000 1803G 1039G  763G 57.64 1.03 248
>>>>>>>> 45   ssd 1.76109  1.00000 1803G  992G  810G 55.06 0.98 236
>>>>>>>> 46   ssd 1.76109  1.00000 1803G 1068G  735G 59.23 1.06 254
>>>>>>>> 47   ssd 1.76109  1.00000 1803G 1020G  783G 56.57 1.01 242
>>>>>>>> 48   ssd 1.76109  1.00000 1803G  945G  857G 52.44 0.94 225
>>>>>>>> 49   ssd 1.76109  1.00000 1803G  649G 1154G 36.01 0.64 139
>>>>>>>> 50   ssd 1.76109  1.00000 1803G  426G 1377G 23.64 0.42  83
>>>>>>>> 51   ssd 1.76109  1.00000 1803G  610G 1193G 33.84 0.60 131
>>>>>>>> 52   ssd 1.76109  1.00000 1803G  558G 1244G 30.98 0.55 118
>>>>>>>> 53   ssd 1.76109  1.00000 1803G  731G 1072G 40.54 0.72 161
>>>>>>>> 54   ssd 1.74599  1.00000 1787G  859G  928G 48.06 0.86 229
>>>>>>>> 55   ssd 1.74599  1.00000 1787G  942G  844G 52.74 0.94 252
>>>>>>>> 56   ssd 1.74599  1.00000 1787G  928G  859G 51.94 0.93 246
>>>>>>>> 57   ssd 1.74599  1.00000 1787G 1039G  748G 58.15 1.04 277
>>>>>>>> 58   ssd 1.74599  1.00000 1787G  963G  824G 53.87 0.96 255
>>>>>>>> 59   ssd 1.74599  1.00000 1787G  909G  877G 50.89 0.91 241
>>>>>>>> 60   ssd 1.74599  1.00000 1787G 1039G  748G 58.15 1.04 277
>>>>>>>> 61   ssd 1.74599  1.00000 1787G  892G  895G 49.91 0.89 238
>>>>>>>> 62   ssd 1.74599  1.00000 1787G  927G  859G 51.90 0.93 245
>>>>>>>> 63   ssd 1.74599  1.00000 1787G  864G  922G 48.39 0.86 229
>>>>>>>> 64   ssd 1.74599  1.00000 1787G  968G  819G 54.16 0.97 257
>>>>>>>> 65   ssd 1.74599  1.00000 1787G  892G  894G 49.93 0.89 237
>>>>>>>> 66   ssd 1.74599  1.00000 1787G  951G  836G 53.23 0.95 252
>>>>>>>> 67   ssd 1.74599  1.00000 1787G  878G  908G 49.16 0.88 232
>>>>>>>> 68   ssd 1.74599  1.00000 1787G  899G  888G 50.29 0.90 238
>>>>>>>> 69   ssd 1.74599  1.00000 1787G  948G  839G 53.04 0.95 252
>>>>>>>> 70   ssd 1.74599  1.00000 1787G  914G  873G 51.15 0.91 246
>>>>>>>> 71   ssd 1.74599  1.00000 1787G 1004G  782G 56.21 1.00 266
>>>>>>>> 72   ssd 1.74599  1.00000 1787G  812G  974G 45.47 0.81 216
>>>>>>>> 73   ssd 1.74599  1.00000 1787G  932G  855G 52.15 0.93 247
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> [email protected] <mailto:[email protected]>
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> [email protected] <mailto:[email protected]>
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> [email protected] <mailto:[email protected]>
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>> 
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD Bluestore Backfills Slow

Reply via email to