Thanks Ashley. Is there a way we could stop writes to the old osd’s and write to only the new osd’s??
On Sun, 28 Apr 2019 at 19:21, Ashley Merrick <singap...@amerrick.co.uk> wrote: > It will mean you have some OSD"s that will perform better than others, but > it won't cause any issues within CEPH. > > It may help you expand your cluster at the speed you need to fix the MAX > Avail issue, however your only going to be able to backfill as fast as the > source OSD's can handle, but writing is more intensive I can imagine the > new OSD's will perform better during the backfill vs not having the SSD > Journal. > > If you have the hardware to do it it won't hurt, however I would say you > want to be careful playing too much with a cluster that's in a recovering > state and having OSD's go up and down, if you end up with a new OSD failing > due to high load it may cause you some further issues with lost objects > e.t.c that wasn't fully replicated. > > So it will be at your own risk to add further OSD's.. > > On Sun, Apr 28, 2019 at 9:40 PM Nikhil R <nikh.ravin...@gmail.com> wrote: > >> Thanks Paul, >> Coming back to my question, is it a good idea to add SSD Journals for HDD >> on a new node in an existing hdd journal and osd cluster? >> >> >> >> >> On Sun, Apr 28, 2019 at 2:49 PM Paul Emmerich <paul.emmer...@croit.io> >> wrote: >> >>> Looks like you got lots of tiny objects. By default the recovery speed >>> on HDDs is limited to 10 objects per second (40 with DB on a SSD) per >>> thread. >>> >>> >>> Decrease osd_recovery_sleep_hdd (default 0.1) to increase >>> recovery/backfill speed. >>> >>> >>> Paul >>> >>> -- >>> Paul Emmerich >>> >>> Looking for help with your Ceph cluster? Contact us at https://croit.io >>> >>> croit GmbH >>> Freseniusstr. 31h >>> <https://maps.google.com/?q=Freseniusstr.+31h+%0D%0A81247+M%C3%BCnchen&entry=gmail&source=g> >>> 81247 München >>> <https://maps.google.com/?q=Freseniusstr.+31h+%0D%0A81247+M%C3%BCnchen&entry=gmail&source=g> >>> www.croit.io >>> Tel: +49 89 1896585 90 >>> >>> On Sun, Apr 28, 2019 at 6:57 AM Nikhil R <nikh.ravin...@gmail.com> >>> wrote: >>> > >>> > Hi, >>> > I have set noout, noscrub and nodeep-scrub and the last time we added >>> osd's we adding few at a time. >>> > The main issue here is with IOPS where the existing osd's are not able >>> to backfill at a higher rate - not even 1 thread during peak hours and a >>> max of 2 threads during off-peak. We are getting more client i/o and the >>> documents ingested are more than the space being freed up by backfilling >>> pg's to new osd's added. >>> > Below is our cluster health >>> > health HEALTH_WARN >>> > 5221 pgs backfill_wait >>> > 31 pgs backfilling >>> > 1453 pgs degraded >>> > 4 pgs recovering >>> > 1054 pgs recovery_wait >>> > 1453 pgs stuck degraded >>> > 6310 pgs stuck unclean >>> > 384 pgs stuck undersized >>> > 384 pgs undersized >>> > recovery 130823732/9142530156 objects degraded (1.431%) >>> > recovery 2446840943/9142530156 objects misplaced (26.763%) >>> > noout,nobackfill,noscrub,nodeep-scrub flag(s) set >>> > mon.mon_1 store is getting too big! 26562 MB >= 15360 MB >>> > mon.mon_2 store is getting too big! 26828 MB >= 15360 MB >>> > mon.mon_3 store is getting too big! 26504 MB >= 15360 MB >>> > monmap e1: 3 mons at >>> {mon_1=x.x.x.x:x.yyyy/0,mon_2=x.x.x.x:yyyy/0,mon_3=x.x.x.x:yyyy/0} >>> > election epoch 7996, quorum 0,1,2 mon_1,mon_2,mon_3 >>> > osdmap e194833: 105 osds: 105 up, 105 in; 5931 remapped pgs >>> > flags >>> noout,nobackfill,noscrub,nodeep-scrub,sortbitwise,require_jewel_osds >>> > pgmap v48390703: 10536 pgs, 18 pools, 144 TB data, 2906 Mobjects >>> > 475 TB used, 287 TB / 763 TB avail >>> > 130823732/9142530156 objects degraded (1.431%) >>> > 2446840943/9142530156 objects misplaced (26.763%) >>> > 4851 active+remapped+wait_backfill >>> > 4226 active+clean >>> > 659 active+recovery_wait+degraded+remapped >>> > 377 active+recovery_wait+degraded >>> > 357 active+undersized+degraded+remapped+wait_backfill >>> > 18 active+recovery_wait+undersized+degraded+remapped >>> > 16 active+degraded+remapped+backfilling >>> > 13 active+degraded+remapped+wait_backfill >>> > 9 active+undersized+degraded+remapped+backfilling >>> > 6 active+remapped+backfilling >>> > 2 active+recovering+degraded >>> > 2 active+recovering+degraded+remapped >>> > client io 11894 kB/s rd, 105 kB/s wr, 981 op/s rd, 72 op/s wr >>> > >>> > So, is it a good option to add new osd's on a new node with ssd's as >>> journals? >>> > in.linkedin.com/in/nikhilravindra >>> > >>> > >>> > >>> > On Sun, Apr 28, 2019 at 6:05 AM Erik McCormick < >>> emccorm...@cirrusseven.com> wrote: >>> >> >>> >> On Sat, Apr 27, 2019, 3:49 PM Nikhil R <nikh.ravin...@gmail.com> >>> wrote: >>> >>> >>> >>> We have baremetal nodes 256GB RAM, 36core CPU >>> >>> We are on ceph jewel 10.2.9 with leveldb >>> >>> The osd’s and journals are on the same hdd. >>> >>> We have 1 backfill_max_active, 1 recovery_max_active and 1 >>> recovery_op_priority >>> >>> The osd crashes and starts once a pg is backfilled and the next pg >>> tried to backfill. This is when we see iostat and the disk is utilised upto >>> 100%. >>> >> >>> >> >>> >> I would set noout to prevent excess movement in the event of OSD >>> flapping, and disable scrubbing and deep scrubbing until your backfilling >>> has completed. I would also bring the new OSDs online a few at a time >>> rather than all 25 at once if you add more servers. >>> >> >>> >>> >>> >>> Appreciate your help David >>> >>> >>> >>> On Sun, 28 Apr 2019 at 00:46, David C <dcsysengin...@gmail.com> >>> wrote: >>> >>>> >>> >>>> >>> >>>> >>> >>>> On Sat, 27 Apr 2019, 18:50 Nikhil R, <nikh.ravin...@gmail.com> >>> wrote: >>> >>>>> >>> >>>>> Guys, >>> >>>>> We now have a total of 105 osd’s on 5 baremetal nodes each hosting >>> 21 osd’s on HDD which are 7Tb with journals on HDD too. Each journal is >>> about 5GB >>> >>>> >>> >>>> >>> >>>> This would imply you've got a separate hdd partition for journals, >>> I don't think there's any value in that and would probabaly be detrimental >>> to performance. >>> >>>>> >>> >>>>> >>> >>>>> We expanded our cluster last week and added 1 more node with 21 >>> HDD and journals on same disk. >>> >>>>> Our client i/o is too heavy and we are not able to backfill even 1 >>> thread during peak hours - incase we backfill during peak hours osd's are >>> crashing causing undersized pg's and if we have another osd crash we wont >>> be able to use our cluster due to undersized and recovery pg's. During >>> non-peak we can just backfill 8-10 pgs. >>> >>>>> Due to this our MAX AVAIL is draining out very fast. >>> >>>> >>> >>>> >>> >>>> How much ram have you got in your nodes? In my experience that's a >>> common reason for crashing OSDs during recovery ops >>> >>>> >>> >>>> What does your recovery and backfill tuning look like? >>> >>>> >>> >>>> >>> >>>>> >>> >>>>> We are thinking of adding 2 more baremetal nodes with 21 *7tb >>> osd’s on HDD and add 50GB SSD Journals for these. >>> >>>>> We aim to backfill from the 105 osd’s a bit faster and expect >>> writes of backfillis coming to these osd’s faster. >>> >>>> >>> >>>> >>> >>>> Ssd journals would certainly help, just be sure it's a model that >>> performs well with Ceph >>> >>>>> >>> >>>>> >>> >>>>> Is this a good viable idea? >>> >>>>> Thoughts please? >>> >>>> >>> >>>> >>> >>>> I'd recommend sharing more detail e.g full spec of the nodes, Ceph >>> version etc. >>> >>>>> >>> >>>>> >>> >>>>> -Nikhil >>> >>>>> _______________________________________________ >>> >>>>> ceph-users mailing list >>> >>>>> ceph-users@lists.ceph.com >>> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >>> -- >>> >>> Sent from my iPhone >>> >>> _______________________________________________ >>> >>> ceph-users mailing list >>> >>> ceph-users@lists.ceph.com >>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > >>> > _______________________________________________ >>> > ceph-users mailing list >>> > ceph-users@lists.ceph.com >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > -- Sent from my iPhone
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com