Re: [ceph-users] Ceph cluster stability

M Ranga Swami Reddy Mon, 25 Feb 2019 01:34:51 -0800

We have taken care all HW recommendations, but missing that ceph mons
are VMs with good configuration (4 core, 64G RAM + 500G disk)...
Is this ceph-mon configuration might cause issues?


On Sat, Feb 23, 2019 at 6:31 AM Anthony D'Atri <[email protected]> wrote:
>
>
> ? Did we start recommending that production mons run on a VM?  I'd be very 
> hesitant to do that, though probably some folks do.
>
> I can say for sure that in the past (Firefly) I experienced outages related 
> to mons running on HDDs.  That was a cluster of 450 HDD OSDs with colo 
> journals and hundreds of RBD clients.  Something obscure about running out of 
> "global IDs" and not being able to create new ones fast enough.  We had to 
> work around with a combo of lease settings on the mons and clients, though 
> with Hammer and later I would not expect that exact situation to arise.  
> Still it left me paranoid about mon DBs and HDDs.
>
> -- aad
>
>
> >
> > But ceph recommendation is to use VM (not even the  HW node
> > recommended). will try to change the mon disk as SSD and HW node.
> >
> > On Fri, Feb 22, 2019 at 5:25 PM Darius Kasparavičius <[email protected]> 
> > wrote:
> >>
> >> If your using hdd for monitor servers. Check their load. It might be
> >> the issue there.
> >>
> >> On Fri, Feb 22, 2019 at 1:50 PM M Ranga Swami Reddy
> >> <[email protected]> wrote:
> >>>
> >>> ceph-mon disk with 500G with HDD (not journals/SSDs).  Yes, mon use
> >>> folder on FS on a disk
> >>>
> >>> On Fri, Feb 22, 2019 at 5:13 PM David Turner <[email protected]> 
> >>> wrote:
> >>>>
> >>>> Mon disks don't have journals, they're just a folder on a filesystem on 
> >>>> a disk.
> >>>>
> >>>> On Fri, Feb 22, 2019, 6:40 AM M Ranga Swami Reddy <[email protected]> 
> >>>> wrote:
> >>>>>
> >>>>> ceph mons looks fine during the recovery.  Using  HDD with SSD
> >>>>> journals. with recommeded CPU and RAM numbers.
> >>>>>
> >>>>> On Fri, Feb 22, 2019 at 4:40 PM David Turner <[email protected]> 
> >>>>> wrote:
> >>>>>>
> >>>>>> What about the system stats on your mons during recovery? If they are 
> >>>>>> having a hard time keeping up with requests during a recovery, I could 
> >>>>>> see that impacting client io. What disks are they running on? CPU? Etc.
> >>>>>>
> >>>>>> On Fri, Feb 22, 2019, 6:01 AM M Ranga Swami Reddy 
> >>>>>> <[email protected]> wrote:
> >>>>>>>
> >>>>>>> Debug setting defaults are using..like 1/5 and 0/5 for almost..
> >>>>>>> Shall I try with 0 for all debug settings?
> >>>>>>>
> >>>>>>> On Wed, Feb 20, 2019 at 9:17 PM Darius Kasparavičius 
> >>>>>>> <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> Hello,
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Check your CPU usage when you are doing those kind of operations. We
> >>>>>>>> had a similar issue where our CPU monitoring was reporting fine < 40%
> >>>>>>>> usage, but our load on the nodes was high mid 60-80. If it's possible
> >>>>>>>> try disabling ht and see the actual cpu usage.
> >>>>>>>> If you are hitting CPU limits you can try disabling crc on messages.
> >>>>>>>> ms_nocrc
> >>>>>>>> ms_crc_data
> >>>>>>>> ms_crc_header
> >>>>>>>>
> >>>>>>>> And setting all your debug messages to 0.
> >>>>>>>> If you haven't done you can also lower your recovery settings a 
> >>>>>>>> little.
> >>>>>>>> osd recovery max active
> >>>>>>>> osd max backfills
> >>>>>>>>
> >>>>>>>> You can also lower your file store threads.
> >>>>>>>> filestore op threads
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> If you can also switch to bluestore from filestore. This will also
> >>>>>>>> lower your CPU usage. I'm not sure that this is bluestore that does
> >>>>>>>> it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
> >>>>>>>> compared to filestore + leveldb .
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
> >>>>>>>> <[email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>> Thats expected from Ceph by design. But in our case, we are using 
> >>>>>>>>> all
> >>>>>>>>> recommendation like rack failure domain, replication n/w,etc, still
> >>>>>>>>> face client IO performance issues during one OSD down..
> >>>>>>>>>
> >>>>>>>>> On Tue, Feb 19, 2019 at 10:56 PM David Turner 
> >>>>>>>>> <[email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>> With a RACK failure domain, you should be able to have an entire 
> >>>>>>>>>> rack powered down without noticing any major impact on the 
> >>>>>>>>>> clients.  I regularly take down OSDs and nodes for maintenance and 
> >>>>>>>>>> upgrades without seeing any problems with client IO.
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy 
> >>>>>>>>>> <[email protected]> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hello - I have a couple of questions on ceph cluster stability, 
> >>>>>>>>>>> even
> >>>>>>>>>>> we follow all recommendations as below:
> >>>>>>>>>>> - Having separate replication n/w and data n/w
> >>>>>>>>>>> - RACK is the failure domain
> >>>>>>>>>>> - Using SSDs for journals (1:4ratio)
> >>>>>>>>>>>
> >>>>>>>>>>> Q1 - If one OSD down, cluster IO down drastically and customer 
> >>>>>>>>>>> Apps impacted.
> >>>>>>>>>>> Q2 - what is stability ratio, like with above, is ceph cluster
> >>>>>>>>>>> workable condition, if one osd down or one node down,etc.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks
> >>>>>>>>>>> Swami
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> ceph-users mailing list
> >>>>>>>>>>> [email protected]
> >>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>> _______________________________________________
> >>>>>>>>> ceph-users mailing list
> >>>>>>>>> [email protected]
> >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph cluster stability

Reply via email to