Re: [ceph-users] IMPORTANT : NEED HELP : Low IOPS on hdd : MAX AVAIL Draining fast

2019-04-28 Thread Nikhil R
Thanks Ashley.
Is there a way we could stop writes to the old osd’s and write to only the
new osd’s??

On Sun, 28 Apr 2019 at 19:21, Ashley Merrick 
wrote:

> It will mean you have some OSD"s that will perform better than others, but
> it won't cause any issues within CEPH.
>
> It may help you expand your cluster at the speed you need to fix the MAX
> Avail issue, however your only going to be able to backfill as fast as the
> source OSD's can handle, but writing is more intensive I can imagine the
> new OSD's will perform better during the backfill vs not having the SSD
> Journal.
>
> If you have the hardware to do it it won't hurt, however I would say you
> want to be careful playing too much with a cluster that's in a recovering
> state and having OSD's go up and down, if you end up with a new OSD failing
> due to high load it may cause you some further issues with lost objects
> e.t.c that wasn't fully replicated.
>
> So it will be at your own risk to add further OSD's..
>
> On Sun, Apr 28, 2019 at 9:40 PM Nikhil R  wrote:
>
>> Thanks Paul,
>> Coming back to my question, is it a good idea to add SSD Journals for HDD
>> on a new node in an existing hdd journal and osd cluster?
>>
>>
>>
>>
>> On Sun, Apr 28, 2019 at 2:49 PM Paul Emmerich 
>> wrote:
>>
>>> Looks like you got lots of tiny objects. By default the recovery speed
>>> on HDDs is limited to 10 objects per second (40 with DB on a SSD) per
>>> thread.
>>>
>>>
>>> Decrease osd_recovery_sleep_hdd (default 0.1) to increase
>>> recovery/backfill speed.
>>>
>>>
>>> Paul
>>>
>>> --
>>> Paul Emmerich
>>>
>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>
>>> croit GmbH
>>> Freseniusstr. 31h
>>> 
>>> 81247 München
>>> 
>>> www.croit.io
>>> Tel: +49 89 1896585 90
>>>
>>> On Sun, Apr 28, 2019 at 6:57 AM Nikhil R 
>>> wrote:
>>> >
>>> > Hi,
>>> > I have set noout, noscrub and nodeep-scrub and the last time we added
>>> osd's we adding few at a time.
>>> > The main issue here is with IOPS where the existing osd's are not able
>>> to backfill at a higher rate - not even 1 thread during peak hours and a
>>> max of 2 threads during off-peak. We are getting more client i/o and the
>>> documents ingested are more than the space being freed up by backfilling
>>> pg's to new osd's added.
>>> > Below is our cluster health
>>> >  health HEALTH_WARN
>>> > 5221 pgs backfill_wait
>>> > 31 pgs backfilling
>>> > 1453 pgs degraded
>>> > 4 pgs recovering
>>> > 1054 pgs recovery_wait
>>> > 1453 pgs stuck degraded
>>> > 6310 pgs stuck unclean
>>> > 384 pgs stuck undersized
>>> > 384 pgs undersized
>>> > recovery 130823732/9142530156 objects degraded (1.431%)
>>> > recovery 2446840943/9142530156 objects misplaced (26.763%)
>>> > noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>>> > mon.mon_1 store is getting too big! 26562 MB >= 15360 MB
>>> > mon.mon_2 store is getting too big! 26828 MB >= 15360 MB
>>> > mon.mon_3 store is getting too big! 26504 MB >= 15360 MB
>>> >  monmap e1: 3 mons at
>>> {mon_1=x.x.x.x:x./0,mon_2=x.x.x.x:/0,mon_3=x.x.x.x:/0}
>>> > election epoch 7996, quorum 0,1,2 mon_1,mon_2,mon_3
>>> >  osdmap e194833: 105 osds: 105 up, 105 in; 5931 remapped pgs
>>> > flags
>>> noout,nobackfill,noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
>>> >   pgmap v48390703: 10536 pgs, 18 pools, 144 TB data, 2906 Mobjects
>>> > 475 TB used, 287 TB / 763 TB avail
>>> > 130823732/9142530156 objects degraded (1.431%)
>>> > 2446840943/9142530156 objects misplaced (26.763%)
>>> > 4851 active+remapped+wait_backfill
>>> > 4226 active+clean
>>> >  659 active+recovery_wait+degraded+remapped
>>> >  377 active+recovery_wait+degraded
>>> >  357 active+undersized+degraded+remapped+wait_backfill
>>> >   18 active+recovery_wait+undersized+degraded+remapped
>>> >   16 active+degraded+remapped+backfilling
>>> >   13 active+degraded+remapped+wait_backfill
>>> >9 active+undersized+degraded+remapped+backfilling
>>> >6 active+remapped+backfilling
>>> >2 active+recovering+degraded
>>> >2 active+recovering+degraded+remapped
>>> >   client io 11894 kB/s rd, 105 kB/s wr, 981 op/s rd, 72 op/s wr
>>> >
>>> > So, is it a good option to add new osd's on a new node with ssd's as
>>> journals?
>>> > in.linkedin.com/in/nikhilravindra
>>> >
>>> >
>>> >
>>> > On Sun, Apr 28, 2019 at 6:05 AM Erik McCormick <
>>> emccorm...@cirrusseven.com> wrote:
>>> >>
>>> >> 

Re: [ceph-users] IMPORTANT : NEED HELP : Low IOPS on hdd : MAX AVAIL Draining fast

2019-04-28 Thread Ashley Merrick
It will mean you have some OSD"s that will perform better than others, but
it won't cause any issues within CEPH.

It may help you expand your cluster at the speed you need to fix the MAX
Avail issue, however your only going to be able to backfill as fast as the
source OSD's can handle, but writing is more intensive I can imagine the
new OSD's will perform better during the backfill vs not having the SSD
Journal.

If you have the hardware to do it it won't hurt, however I would say you
want to be careful playing too much with a cluster that's in a recovering
state and having OSD's go up and down, if you end up with a new OSD failing
due to high load it may cause you some further issues with lost objects
e.t.c that wasn't fully replicated.

So it will be at your own risk to add further OSD's..

On Sun, Apr 28, 2019 at 9:40 PM Nikhil R  wrote:

> Thanks Paul,
> Coming back to my question, is it a good idea to add SSD Journals for HDD
> on a new node in an existing hdd journal and osd cluster?
>
>
>
>
> On Sun, Apr 28, 2019 at 2:49 PM Paul Emmerich 
> wrote:
>
>> Looks like you got lots of tiny objects. By default the recovery speed
>> on HDDs is limited to 10 objects per second (40 with DB on a SSD) per
>> thread.
>>
>>
>> Decrease osd_recovery_sleep_hdd (default 0.1) to increase
>> recovery/backfill speed.
>>
>>
>> Paul
>>
>> --
>> Paul Emmerich
>>
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>
>> croit GmbH
>> Freseniusstr. 31h
>> 81247 München
>> www.croit.io
>> Tel: +49 89 1896585 90
>>
>> On Sun, Apr 28, 2019 at 6:57 AM Nikhil R  wrote:
>> >
>> > Hi,
>> > I have set noout, noscrub and nodeep-scrub and the last time we added
>> osd's we adding few at a time.
>> > The main issue here is with IOPS where the existing osd's are not able
>> to backfill at a higher rate - not even 1 thread during peak hours and a
>> max of 2 threads during off-peak. We are getting more client i/o and the
>> documents ingested are more than the space being freed up by backfilling
>> pg's to new osd's added.
>> > Below is our cluster health
>> >  health HEALTH_WARN
>> > 5221 pgs backfill_wait
>> > 31 pgs backfilling
>> > 1453 pgs degraded
>> > 4 pgs recovering
>> > 1054 pgs recovery_wait
>> > 1453 pgs stuck degraded
>> > 6310 pgs stuck unclean
>> > 384 pgs stuck undersized
>> > 384 pgs undersized
>> > recovery 130823732/9142530156 objects degraded (1.431%)
>> > recovery 2446840943/9142530156 objects misplaced (26.763%)
>> > noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>> > mon.mon_1 store is getting too big! 26562 MB >= 15360 MB
>> > mon.mon_2 store is getting too big! 26828 MB >= 15360 MB
>> > mon.mon_3 store is getting too big! 26504 MB >= 15360 MB
>> >  monmap e1: 3 mons at
>> {mon_1=x.x.x.x:x./0,mon_2=x.x.x.x:/0,mon_3=x.x.x.x:/0}
>> > election epoch 7996, quorum 0,1,2 mon_1,mon_2,mon_3
>> >  osdmap e194833: 105 osds: 105 up, 105 in; 5931 remapped pgs
>> > flags
>> noout,nobackfill,noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
>> >   pgmap v48390703: 10536 pgs, 18 pools, 144 TB data, 2906 Mobjects
>> > 475 TB used, 287 TB / 763 TB avail
>> > 130823732/9142530156 objects degraded (1.431%)
>> > 2446840943/9142530156 objects misplaced (26.763%)
>> > 4851 active+remapped+wait_backfill
>> > 4226 active+clean
>> >  659 active+recovery_wait+degraded+remapped
>> >  377 active+recovery_wait+degraded
>> >  357 active+undersized+degraded+remapped+wait_backfill
>> >   18 active+recovery_wait+undersized+degraded+remapped
>> >   16 active+degraded+remapped+backfilling
>> >   13 active+degraded+remapped+wait_backfill
>> >9 active+undersized+degraded+remapped+backfilling
>> >6 active+remapped+backfilling
>> >2 active+recovering+degraded
>> >2 active+recovering+degraded+remapped
>> >   client io 11894 kB/s rd, 105 kB/s wr, 981 op/s rd, 72 op/s wr
>> >
>> > So, is it a good option to add new osd's on a new node with ssd's as
>> journals?
>> > in.linkedin.com/in/nikhilravindra
>> >
>> >
>> >
>> > On Sun, Apr 28, 2019 at 6:05 AM Erik McCormick <
>> emccorm...@cirrusseven.com> wrote:
>> >>
>> >> On Sat, Apr 27, 2019, 3:49 PM Nikhil R 
>> wrote:
>> >>>
>> >>> We have baremetal nodes 256GB RAM, 36core CPU
>> >>> We are on ceph jewel 10.2.9 with leveldb
>> >>> The osd’s and journals are on the same hdd.
>> >>> We have 1 backfill_max_active, 1 recovery_max_active and 1
>> recovery_op_priority
>> >>> The osd crashes and starts once a pg is backfilled and the next pg
>> tried to backfill. This is when we see iostat and the disk is utilised upto
>> 

Re: [ceph-users] IMPORTANT : NEED HELP : Low IOPS on hdd : MAX AVAIL Draining fast

2019-04-28 Thread Nikhil R
Thanks Paul,
Coming back to my question, is it a good idea to add SSD Journals for HDD
on a new node in an existing hdd journal and osd cluster?




On Sun, Apr 28, 2019 at 2:49 PM Paul Emmerich 
wrote:

> Looks like you got lots of tiny objects. By default the recovery speed
> on HDDs is limited to 10 objects per second (40 with DB on a SSD) per
> thread.
>
>
> Decrease osd_recovery_sleep_hdd (default 0.1) to increase
> recovery/backfill speed.
>
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Sun, Apr 28, 2019 at 6:57 AM Nikhil R  wrote:
> >
> > Hi,
> > I have set noout, noscrub and nodeep-scrub and the last time we added
> osd's we adding few at a time.
> > The main issue here is with IOPS where the existing osd's are not able
> to backfill at a higher rate - not even 1 thread during peak hours and a
> max of 2 threads during off-peak. We are getting more client i/o and the
> documents ingested are more than the space being freed up by backfilling
> pg's to new osd's added.
> > Below is our cluster health
> >  health HEALTH_WARN
> > 5221 pgs backfill_wait
> > 31 pgs backfilling
> > 1453 pgs degraded
> > 4 pgs recovering
> > 1054 pgs recovery_wait
> > 1453 pgs stuck degraded
> > 6310 pgs stuck unclean
> > 384 pgs stuck undersized
> > 384 pgs undersized
> > recovery 130823732/9142530156 objects degraded (1.431%)
> > recovery 2446840943/9142530156 objects misplaced (26.763%)
> > noout,nobackfill,noscrub,nodeep-scrub flag(s) set
> > mon.mon_1 store is getting too big! 26562 MB >= 15360 MB
> > mon.mon_2 store is getting too big! 26828 MB >= 15360 MB
> > mon.mon_3 store is getting too big! 26504 MB >= 15360 MB
> >  monmap e1: 3 mons at
> {mon_1=x.x.x.x:x./0,mon_2=x.x.x.x:/0,mon_3=x.x.x.x:/0}
> > election epoch 7996, quorum 0,1,2 mon_1,mon_2,mon_3
> >  osdmap e194833: 105 osds: 105 up, 105 in; 5931 remapped pgs
> > flags
> noout,nobackfill,noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
> >   pgmap v48390703: 10536 pgs, 18 pools, 144 TB data, 2906 Mobjects
> > 475 TB used, 287 TB / 763 TB avail
> > 130823732/9142530156 objects degraded (1.431%)
> > 2446840943/9142530156 objects misplaced (26.763%)
> > 4851 active+remapped+wait_backfill
> > 4226 active+clean
> >  659 active+recovery_wait+degraded+remapped
> >  377 active+recovery_wait+degraded
> >  357 active+undersized+degraded+remapped+wait_backfill
> >   18 active+recovery_wait+undersized+degraded+remapped
> >   16 active+degraded+remapped+backfilling
> >   13 active+degraded+remapped+wait_backfill
> >9 active+undersized+degraded+remapped+backfilling
> >6 active+remapped+backfilling
> >2 active+recovering+degraded
> >2 active+recovering+degraded+remapped
> >   client io 11894 kB/s rd, 105 kB/s wr, 981 op/s rd, 72 op/s wr
> >
> > So, is it a good option to add new osd's on a new node with ssd's as
> journals?
> > in.linkedin.com/in/nikhilravindra
> >
> >
> >
> > On Sun, Apr 28, 2019 at 6:05 AM Erik McCormick <
> emccorm...@cirrusseven.com> wrote:
> >>
> >> On Sat, Apr 27, 2019, 3:49 PM Nikhil R  wrote:
> >>>
> >>> We have baremetal nodes 256GB RAM, 36core CPU
> >>> We are on ceph jewel 10.2.9 with leveldb
> >>> The osd’s and journals are on the same hdd.
> >>> We have 1 backfill_max_active, 1 recovery_max_active and 1
> recovery_op_priority
> >>> The osd crashes and starts once a pg is backfilled and the next pg
> tried to backfill. This is when we see iostat and the disk is utilised upto
> 100%.
> >>
> >>
> >> I would set noout to prevent excess movement in the event of OSD
> flapping, and disable scrubbing and deep scrubbing until your backfilling
> has completed. I would also bring the new OSDs online a few at a time
> rather than all 25 at once if you add more servers.
> >>
> >>>
> >>> Appreciate your help David
> >>>
> >>> On Sun, 28 Apr 2019 at 00:46, David C  wrote:
> 
> 
> 
>  On Sat, 27 Apr 2019, 18:50 Nikhil R,  wrote:
> >
> > Guys,
> > We now have a total of 105 osd’s on 5 baremetal nodes each hosting
> 21 osd’s on HDD which are 7Tb with journals on HDD too. Each journal is
> about 5GB
> 
> 
>  This would imply you've got a separate hdd partition for journals, I
> don't think there's any value in that and would probabaly be detrimental to
> performance.
> >
> >
> > We expanded our cluster last week and added 1 more node with 21 HDD
> and journals on same disk.
> > Our 

Re: [ceph-users] IMPORTANT : NEED HELP : Low IOPS on hdd : MAX AVAIL Draining fast

2019-04-28 Thread Paul Emmerich
Looks like you got lots of tiny objects. By default the recovery speed
on HDDs is limited to 10 objects per second (40 with DB on a SSD) per
thread.


Decrease osd_recovery_sleep_hdd (default 0.1) to increase
recovery/backfill speed.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Sun, Apr 28, 2019 at 6:57 AM Nikhil R  wrote:
>
> Hi,
> I have set noout, noscrub and nodeep-scrub and the last time we added osd's 
> we adding few at a time.
> The main issue here is with IOPS where the existing osd's are not able to 
> backfill at a higher rate - not even 1 thread during peak hours and a max of 
> 2 threads during off-peak. We are getting more client i/o and the documents 
> ingested are more than the space being freed up by backfilling pg's to new 
> osd's added.
> Below is our cluster health
>  health HEALTH_WARN
> 5221 pgs backfill_wait
> 31 pgs backfilling
> 1453 pgs degraded
> 4 pgs recovering
> 1054 pgs recovery_wait
> 1453 pgs stuck degraded
> 6310 pgs stuck unclean
> 384 pgs stuck undersized
> 384 pgs undersized
> recovery 130823732/9142530156 objects degraded (1.431%)
> recovery 2446840943/9142530156 objects misplaced (26.763%)
> noout,nobackfill,noscrub,nodeep-scrub flag(s) set
> mon.mon_1 store is getting too big! 26562 MB >= 15360 MB
> mon.mon_2 store is getting too big! 26828 MB >= 15360 MB
> mon.mon_3 store is getting too big! 26504 MB >= 15360 MB
>  monmap e1: 3 mons at 
> {mon_1=x.x.x.x:x./0,mon_2=x.x.x.x:/0,mon_3=x.x.x.x:/0}
> election epoch 7996, quorum 0,1,2 mon_1,mon_2,mon_3
>  osdmap e194833: 105 osds: 105 up, 105 in; 5931 remapped pgs
> flags 
> noout,nobackfill,noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
>   pgmap v48390703: 10536 pgs, 18 pools, 144 TB data, 2906 Mobjects
> 475 TB used, 287 TB / 763 TB avail
> 130823732/9142530156 objects degraded (1.431%)
> 2446840943/9142530156 objects misplaced (26.763%)
> 4851 active+remapped+wait_backfill
> 4226 active+clean
>  659 active+recovery_wait+degraded+remapped
>  377 active+recovery_wait+degraded
>  357 active+undersized+degraded+remapped+wait_backfill
>   18 active+recovery_wait+undersized+degraded+remapped
>   16 active+degraded+remapped+backfilling
>   13 active+degraded+remapped+wait_backfill
>9 active+undersized+degraded+remapped+backfilling
>6 active+remapped+backfilling
>2 active+recovering+degraded
>2 active+recovering+degraded+remapped
>   client io 11894 kB/s rd, 105 kB/s wr, 981 op/s rd, 72 op/s wr
>
> So, is it a good option to add new osd's on a new node with ssd's as journals?
> in.linkedin.com/in/nikhilravindra
>
>
>
> On Sun, Apr 28, 2019 at 6:05 AM Erik McCormick  
> wrote:
>>
>> On Sat, Apr 27, 2019, 3:49 PM Nikhil R  wrote:
>>>
>>> We have baremetal nodes 256GB RAM, 36core CPU
>>> We are on ceph jewel 10.2.9 with leveldb
>>> The osd’s and journals are on the same hdd.
>>> We have 1 backfill_max_active, 1 recovery_max_active and 1 
>>> recovery_op_priority
>>> The osd crashes and starts once a pg is backfilled and the next pg tried to 
>>> backfill. This is when we see iostat and the disk is utilised upto 100%.
>>
>>
>> I would set noout to prevent excess movement in the event of OSD flapping, 
>> and disable scrubbing and deep scrubbing until your backfilling has 
>> completed. I would also bring the new OSDs online a few at a time rather 
>> than all 25 at once if you add more servers.
>>
>>>
>>> Appreciate your help David
>>>
>>> On Sun, 28 Apr 2019 at 00:46, David C  wrote:



 On Sat, 27 Apr 2019, 18:50 Nikhil R,  wrote:
>
> Guys,
> We now have a total of 105 osd’s on 5 baremetal nodes each hosting 21 
> osd’s on HDD which are 7Tb with journals on HDD too. Each journal is 
> about 5GB


 This would imply you've got a separate hdd partition for journals, I don't 
 think there's any value in that and would probabaly be detrimental to 
 performance.
>
>
> We expanded our cluster last week and added 1 more node with 21 HDD and 
> journals on same disk.
> Our client i/o is too heavy and we are not able to backfill even 1 thread 
> during peak hours - incase we backfill during peak hours osd's are 
> crashing causing undersized pg's and if we have another osd crash we wont 
> be able to use our cluster due to undersized and recovery pg's. During 
> non-peak we can just backfill 8-10 pgs.
> Due to this our MAX AVAIL is 

Re: [ceph-users] IMPORTANT : NEED HELP : Low IOPS on hdd : MAX AVAIL Draining fast

2019-04-27 Thread Nikhil R
Hi,
I have set noout, noscrub and nodeep-scrub and the last time we added osd's
we adding few at a time.
The main issue here is with IOPS where the existing osd's are not able to
backfill at a higher rate - not even 1 thread during peak hours and a max
of 2 threads during off-peak. We are getting more client i/o and the
documents ingested are more than the space being freed up by backfilling
pg's to new osd's added.
Below is our cluster health
 health HEALTH_WARN
5221 pgs backfill_wait
31 pgs backfilling
1453 pgs degraded
4 pgs recovering
1054 pgs recovery_wait
1453 pgs stuck degraded
6310 pgs stuck unclean
384 pgs stuck undersized
384 pgs undersized
recovery 130823732/9142530156 objects degraded (1.431%)
recovery 2446840943/9142530156 objects misplaced (26.763%)
noout,nobackfill,noscrub,nodeep-scrub flag(s) set
mon.mon_1 store is getting too big! 26562 MB >= 15360 MB
mon.mon_2 store is getting too big! 26828 MB >= 15360 MB
mon.mon_3 store is getting too big! 26504 MB >= 15360 MB
 monmap e1: 3 mons at
{mon_1=x.x.x.x:x./0,mon_2=x.x.x.x:/0,mon_3=x.x.x.x:/0}
election epoch 7996, quorum 0,1,2 mon_1,mon_2,mon_3
 osdmap e194833: 105 osds: 105 up, 105 in; 5931 remapped pgs
flags
noout,nobackfill,noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
  pgmap v48390703: 10536 pgs, 18 pools, 144 TB data, 2906 Mobjects
475 TB used, 287 TB / 763 TB avail
130823732/9142530156 objects degraded (1.431%)
2446840943/9142530156 objects misplaced (26.763%)
4851 active+remapped+wait_backfill
4226 active+clean
 659 active+recovery_wait+degraded+remapped
 377 active+recovery_wait+degraded
 357 active+undersized+degraded+remapped+wait_backfill
  18 active+recovery_wait+undersized+degraded+remapped
  16 active+degraded+remapped+backfilling
  13 active+degraded+remapped+wait_backfill
   9 active+undersized+degraded+remapped+backfilling
   6 active+remapped+backfilling
   2 active+recovering+degraded
   2 active+recovering+degraded+remapped
  client io 11894 kB/s rd, 105 kB/s wr, 981 op/s rd, 72 op/s wr

So, is it a good option to add new osd's on a new node with ssd's as
journals?
in.linkedin.com/in/nikhilravindra



On Sun, Apr 28, 2019 at 6:05 AM Erik McCormick 
wrote:

> On Sat, Apr 27, 2019, 3:49 PM Nikhil R  wrote:
>
>> We have baremetal nodes 256GB RAM, 36core CPU
>> We are on ceph jewel 10.2.9 with leveldb
>> The osd’s and journals are on the same hdd.
>> We have 1 backfill_max_active, 1 recovery_max_active and 1
>> recovery_op_priority
>> The osd crashes and starts once a pg is backfilled and the next pg tried
>> to backfill. This is when we see iostat and the disk is utilised upto 100%.
>>
>
> I would set noout to prevent excess movement in the event of OSD flapping,
> and disable scrubbing and deep scrubbing until your backfilling has
> completed. I would also bring the new OSDs online a few at a time rather
> than all 25 at once if you add more servers.
>
>
>> Appreciate your help David
>>
>> On Sun, 28 Apr 2019 at 00:46, David C  wrote:
>>
>>>
>>>
>>> On Sat, 27 Apr 2019, 18:50 Nikhil R,  wrote:
>>>
 Guys,
 We now have a total of 105 osd’s on 5 baremetal nodes each hosting 21
 osd’s on HDD which are 7Tb with journals on HDD too. Each journal is about
 5GB

>>>
>>> This would imply you've got a separate hdd partition for journals, I
>>> don't think there's any value in that and would probabaly be detrimental to
>>> performance.
>>>

 We expanded our cluster last week and added 1 more node with 21 HDD and
 journals on same disk.
 Our client i/o is too heavy and we are not able to backfill even 1
 thread during peak hours - incase we backfill during peak hours osd's are
 crashing causing undersized pg's and if we have another osd crash we wont
 be able to use our cluster due to undersized and recovery pg's. During
 non-peak we can just backfill 8-10 pgs.
 Due to this our MAX AVAIL is draining out very fast.

>>>
>>> How much ram have you got in your nodes? In my experience that's a
>>> common reason for crashing OSDs during recovery ops
>>>
>>> What does your recovery and backfill tuning look like?
>>>
>>>
>>>
 We are thinking of adding 2 more baremetal nodes with 21 *7tb  osd’s on
  HDD and add 50GB SSD Journals for these.
 We aim to backfill from the 105 osd’s a bit faster and expect writes of
 backfillis coming to these osd’s faster.

>>>
>>> Ssd journals would certainly help, just be sure it's a model that
>>> performs well with Ceph
>>>

 Is this a good viable idea?
 Thoughts 

Re: [ceph-users] IMPORTANT : NEED HELP : Low IOPS on hdd : MAX AVAIL Draining fast

2019-04-27 Thread Erik McCormick
On Sat, Apr 27, 2019, 3:49 PM Nikhil R  wrote:

> We have baremetal nodes 256GB RAM, 36core CPU
> We are on ceph jewel 10.2.9 with leveldb
> The osd’s and journals are on the same hdd.
> We have 1 backfill_max_active, 1 recovery_max_active and 1
> recovery_op_priority
> The osd crashes and starts once a pg is backfilled and the next pg tried
> to backfill. This is when we see iostat and the disk is utilised upto 100%.
>

I would set noout to prevent excess movement in the event of OSD flapping,
and disable scrubbing and deep scrubbing until your backfilling has
completed. I would also bring the new OSDs online a few at a time rather
than all 25 at once if you add more servers.


> Appreciate your help David
>
> On Sun, 28 Apr 2019 at 00:46, David C  wrote:
>
>>
>>
>> On Sat, 27 Apr 2019, 18:50 Nikhil R,  wrote:
>>
>>> Guys,
>>> We now have a total of 105 osd’s on 5 baremetal nodes each hosting 21
>>> osd’s on HDD which are 7Tb with journals on HDD too. Each journal is about
>>> 5GB
>>>
>>
>> This would imply you've got a separate hdd partition for journals, I
>> don't think there's any value in that and would probabaly be detrimental to
>> performance.
>>
>>>
>>> We expanded our cluster last week and added 1 more node with 21 HDD and
>>> journals on same disk.
>>> Our client i/o is too heavy and we are not able to backfill even 1
>>> thread during peak hours - incase we backfill during peak hours osd's are
>>> crashing causing undersized pg's and if we have another osd crash we wont
>>> be able to use our cluster due to undersized and recovery pg's. During
>>> non-peak we can just backfill 8-10 pgs.
>>> Due to this our MAX AVAIL is draining out very fast.
>>>
>>
>> How much ram have you got in your nodes? In my experience that's a common
>> reason for crashing OSDs during recovery ops
>>
>> What does your recovery and backfill tuning look like?
>>
>>
>>
>>> We are thinking of adding 2 more baremetal nodes with 21 *7tb  osd’s on
>>>  HDD and add 50GB SSD Journals for these.
>>> We aim to backfill from the 105 osd’s a bit faster and expect writes of
>>> backfillis coming to these osd’s faster.
>>>
>>
>> Ssd journals would certainly help, just be sure it's a model that
>> performs well with Ceph
>>
>>>
>>> Is this a good viable idea?
>>> Thoughts please?
>>>
>>
>> I'd recommend sharing more detail e.g full spec of the nodes, Ceph
>> version etc.
>>
>>>
>>> -Nikhil
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> --
> Sent from my iPhone
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IMPORTANT : NEED HELP : Low IOPS on hdd : MAX AVAIL Draining fast

2019-04-27 Thread Nikhil R
We have baremetal nodes 256GB RAM, 36core CPU
We are on ceph jewel 10.2.9 with leveldb
The osd’s and journals are on the same hdd.
We have 1 backfill_max_active, 1 recovery_max_active and 1
recovery_op_priority
The osd crashes and starts once a pg is backfilled and the next pg tried to
backfill. This is when we see iostat and the disk is utilised upto 100%.

Appreciate your help David

On Sun, 28 Apr 2019 at 00:46, David C  wrote:

>
>
> On Sat, 27 Apr 2019, 18:50 Nikhil R,  wrote:
>
>> Guys,
>> We now have a total of 105 osd’s on 5 baremetal nodes each hosting 21
>> osd’s on HDD which are 7Tb with journals on HDD too. Each journal is about
>> 5GB
>>
>
> This would imply you've got a separate hdd partition for journals, I don't
> think there's any value in that and would probabaly be detrimental to
> performance.
>
>>
>> We expanded our cluster last week and added 1 more node with 21 HDD and
>> journals on same disk.
>> Our client i/o is too heavy and we are not able to backfill even 1 thread
>> during peak hours - incase we backfill during peak hours osd's are crashing
>> causing undersized pg's and if we have another osd crash we wont be able to
>> use our cluster due to undersized and recovery pg's. During non-peak we can
>> just backfill 8-10 pgs.
>> Due to this our MAX AVAIL is draining out very fast.
>>
>
> How much ram have you got in your nodes? In my experience that's a common
> reason for crashing OSDs during recovery ops
>
> What does your recovery and backfill tuning look like?
>
>
>
>> We are thinking of adding 2 more baremetal nodes with 21 *7tb  osd’s on
>>  HDD and add 50GB SSD Journals for these.
>> We aim to backfill from the 105 osd’s a bit faster and expect writes of
>> backfillis coming to these osd’s faster.
>>
>
> Ssd journals would certainly help, just be sure it's a model that performs
> well with Ceph
>
>>
>> Is this a good viable idea?
>> Thoughts please?
>>
>
> I'd recommend sharing more detail e.g full spec of the nodes, Ceph version
> etc.
>
>>
>> -Nikhil
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> --
Sent from my iPhone
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IMPORTANT : NEED HELP : Low IOPS on hdd : MAX AVAIL Draining fast

2019-04-27 Thread David C
On Sat, 27 Apr 2019, 18:50 Nikhil R,  wrote:

> Guys,
> We now have a total of 105 osd’s on 5 baremetal nodes each hosting 21
> osd’s on HDD which are 7Tb with journals on HDD too. Each journal is about
> 5GB
>

This would imply you've got a separate hdd partition for journals, I don't
think there's any value in that and would probabaly be detrimental to
performance.

>
> We expanded our cluster last week and added 1 more node with 21 HDD and
> journals on same disk.
> Our client i/o is too heavy and we are not able to backfill even 1 thread
> during peak hours - incase we backfill during peak hours osd's are crashing
> causing undersized pg's and if we have another osd crash we wont be able to
> use our cluster due to undersized and recovery pg's. During non-peak we can
> just backfill 8-10 pgs.
> Due to this our MAX AVAIL is draining out very fast.
>

How much ram have you got in your nodes? In my experience that's a common
reason for crashing OSDs during recovery ops

What does your recovery and backfill tuning look like?



> We are thinking of adding 2 more baremetal nodes with 21 *7tb  osd’s on
>  HDD and add 50GB SSD Journals for these.
> We aim to backfill from the 105 osd’s a bit faster and expect writes of
> backfillis coming to these osd’s faster.
>

Ssd journals would certainly help, just be sure it's a model that performs
well with Ceph

>
> Is this a good viable idea?
> Thoughts please?
>

I'd recommend sharing more detail e.g full spec of the nodes, Ceph version
etc.

>
> -Nikhil
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com