Looks like you got lots of tiny objects. By default the recovery speed
on HDDs is limited to 10 objects per second (40 with DB on a SSD) per
thread.


Decrease osd_recovery_sleep_hdd (default 0.1) to increase
recovery/backfill speed.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Sun, Apr 28, 2019 at 6:57 AM Nikhil R <nikh.ravin...@gmail.com> wrote:
>
> Hi,
> I have set noout, noscrub and nodeep-scrub and the last time we added osd's 
> we adding few at a time.
> The main issue here is with IOPS where the existing osd's are not able to 
> backfill at a higher rate - not even 1 thread during peak hours and a max of 
> 2 threads during off-peak. We are getting more client i/o and the documents 
> ingested are more than the space being freed up by backfilling pg's to new 
> osd's added.
> Below is our cluster health
>  health HEALTH_WARN
>             5221 pgs backfill_wait
>             31 pgs backfilling
>             1453 pgs degraded
>             4 pgs recovering
>             1054 pgs recovery_wait
>             1453 pgs stuck degraded
>             6310 pgs stuck unclean
>             384 pgs stuck undersized
>             384 pgs undersized
>             recovery 130823732/9142530156 objects degraded (1.431%)
>             recovery 2446840943/9142530156 objects misplaced (26.763%)
>             noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>             mon.mon_1 store is getting too big! 26562 MB >= 15360 MB
>             mon.mon_2 store is getting too big! 26828 MB >= 15360 MB
>             mon.mon_3 store is getting too big! 26504 MB >= 15360 MB
>      monmap e1: 3 mons at 
> {mon_1=x.x.x.x:x.yyyy/0,mon_2=x.x.x.x:yyyy/0,mon_3=x.x.x.x:yyyy/0}
>             election epoch 7996, quorum 0,1,2 mon_1,mon_2,mon_3
>      osdmap e194833: 105 osds: 105 up, 105 in; 5931 remapped pgs
>             flags 
> noout,nobackfill,noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
>       pgmap v48390703: 10536 pgs, 18 pools, 144 TB data, 2906 Mobjects
>             475 TB used, 287 TB / 763 TB avail
>             130823732/9142530156 objects degraded (1.431%)
>             2446840943/9142530156 objects misplaced (26.763%)
>                 4851 active+remapped+wait_backfill
>                 4226 active+clean
>                  659 active+recovery_wait+degraded+remapped
>                  377 active+recovery_wait+degraded
>                  357 active+undersized+degraded+remapped+wait_backfill
>                   18 active+recovery_wait+undersized+degraded+remapped
>                   16 active+degraded+remapped+backfilling
>                   13 active+degraded+remapped+wait_backfill
>                    9 active+undersized+degraded+remapped+backfilling
>                    6 active+remapped+backfilling
>                    2 active+recovering+degraded
>                    2 active+recovering+degraded+remapped
>   client io 11894 kB/s rd, 105 kB/s wr, 981 op/s rd, 72 op/s wr
>
> So, is it a good option to add new osd's on a new node with ssd's as journals?
> in.linkedin.com/in/nikhilravindra
>
>
>
> On Sun, Apr 28, 2019 at 6:05 AM Erik McCormick <emccorm...@cirrusseven.com> 
> wrote:
>>
>> On Sat, Apr 27, 2019, 3:49 PM Nikhil R <nikh.ravin...@gmail.com> wrote:
>>>
>>> We have baremetal nodes 256GB RAM, 36core CPU
>>> We are on ceph jewel 10.2.9 with leveldb
>>> The osd’s and journals are on the same hdd.
>>> We have 1 backfill_max_active, 1 recovery_max_active and 1 
>>> recovery_op_priority
>>> The osd crashes and starts once a pg is backfilled and the next pg tried to 
>>> backfill. This is when we see iostat and the disk is utilised upto 100%.
>>
>>
>> I would set noout to prevent excess movement in the event of OSD flapping, 
>> and disable scrubbing and deep scrubbing until your backfilling has 
>> completed. I would also bring the new OSDs online a few at a time rather 
>> than all 25 at once if you add more servers.
>>
>>>
>>> Appreciate your help David
>>>
>>> On Sun, 28 Apr 2019 at 00:46, David C <dcsysengin...@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On Sat, 27 Apr 2019, 18:50 Nikhil R, <nikh.ravin...@gmail.com> wrote:
>>>>>
>>>>> Guys,
>>>>> We now have a total of 105 osd’s on 5 baremetal nodes each hosting 21 
>>>>> osd’s on HDD which are 7Tb with journals on HDD too. Each journal is 
>>>>> about 5GB
>>>>
>>>>
>>>> This would imply you've got a separate hdd partition for journals, I don't 
>>>> think there's any value in that and would probabaly be detrimental to 
>>>> performance.
>>>>>
>>>>>
>>>>> We expanded our cluster last week and added 1 more node with 21 HDD and 
>>>>> journals on same disk.
>>>>> Our client i/o is too heavy and we are not able to backfill even 1 thread 
>>>>> during peak hours - incase we backfill during peak hours osd's are 
>>>>> crashing causing undersized pg's and if we have another osd crash we wont 
>>>>> be able to use our cluster due to undersized and recovery pg's. During 
>>>>> non-peak we can just backfill 8-10 pgs.
>>>>> Due to this our MAX AVAIL is draining out very fast.
>>>>
>>>>
>>>> How much ram have you got in your nodes? In my experience that's a common 
>>>> reason for crashing OSDs during recovery ops
>>>>
>>>> What does your recovery and backfill tuning look like?
>>>>
>>>>
>>>>>
>>>>> We are thinking of adding 2 more baremetal nodes with 21 *7tb  osd’s on  
>>>>> HDD and add 50GB SSD Journals for these.
>>>>> We aim to backfill from the 105 osd’s a bit faster and expect writes of 
>>>>> backfillis coming to these osd’s faster.
>>>>
>>>>
>>>> Ssd journals would certainly help, just be sure it's a model that performs 
>>>> well with Ceph
>>>>>
>>>>>
>>>>> Is this a good viable idea?
>>>>> Thoughts please?
>>>>
>>>>
>>>> I'd recommend sharing more detail e.g full spec of the nodes, Ceph version 
>>>> etc.
>>>>>
>>>>>
>>>>> -Nikhil
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> --
>>> Sent from my iPhone
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to