Re: [ceph-users] Ceph cluster stability

M Ranga Swami Reddy Fri, 22 Feb 2019 04:14:38 -0800

But ceph recommendation is to use VM (not even the  HW node
recommended). will try to change the mon disk as SSD and HW node.


On Fri, Feb 22, 2019 at 5:25 PM Darius Kasparavičius <[email protected]> wrote:
>
> If your using hdd for monitor servers. Check their load. It might be
> the issue there.
>
> On Fri, Feb 22, 2019 at 1:50 PM M Ranga Swami Reddy
> <[email protected]> wrote:
> >
> > ceph-mon disk with 500G with HDD (not journals/SSDs).  Yes, mon use
> > folder on FS on a disk
> >
> > On Fri, Feb 22, 2019 at 5:13 PM David Turner <[email protected]> wrote:
> > >
> > > Mon disks don't have journals, they're just a folder on a filesystem on a 
> > > disk.
> > >
> > > On Fri, Feb 22, 2019, 6:40 AM M Ranga Swami Reddy <[email protected]> 
> > > wrote:
> > >>
> > >> ceph mons looks fine during the recovery.  Using  HDD with SSD
> > >> journals. with recommeded CPU and RAM numbers.
> > >>
> > >> On Fri, Feb 22, 2019 at 4:40 PM David Turner <[email protected]> 
> > >> wrote:
> > >> >
> > >> > What about the system stats on your mons during recovery? If they are 
> > >> > having a hard time keeping up with requests during a recovery, I could 
> > >> > see that impacting client io. What disks are they running on? CPU? Etc.
> > >> >
> > >> > On Fri, Feb 22, 2019, 6:01 AM M Ranga Swami Reddy 
> > >> > <[email protected]> wrote:
> > >> >>
> > >> >> Debug setting defaults are using..like 1/5 and 0/5 for almost..
> > >> >> Shall I try with 0 for all debug settings?
> > >> >>
> > >> >> On Wed, Feb 20, 2019 at 9:17 PM Darius Kasparavičius 
> > >> >> <[email protected]> wrote:
> > >> >> >
> > >> >> > Hello,
> > >> >> >
> > >> >> >
> > >> >> > Check your CPU usage when you are doing those kind of operations. We
> > >> >> > had a similar issue where our CPU monitoring was reporting fine < 
> > >> >> > 40%
> > >> >> > usage, but our load on the nodes was high mid 60-80. If it's 
> > >> >> > possible
> > >> >> > try disabling ht and see the actual cpu usage.
> > >> >> > If you are hitting CPU limits you can try disabling crc on messages.
> > >> >> > ms_nocrc
> > >> >> > ms_crc_data
> > >> >> > ms_crc_header
> > >> >> >
> > >> >> > And setting all your debug messages to 0.
> > >> >> > If you haven't done you can also lower your recovery settings a 
> > >> >> > little.
> > >> >> > osd recovery max active
> > >> >> > osd max backfills
> > >> >> >
> > >> >> > You can also lower your file store threads.
> > >> >> > filestore op threads
> > >> >> >
> > >> >> >
> > >> >> > If you can also switch to bluestore from filestore. This will also
> > >> >> > lower your CPU usage. I'm not sure that this is bluestore that does
> > >> >> > it, but I'm seeing lower cpu usage when moving to bluestore + 
> > >> >> > rocksdb
> > >> >> > compared to filestore + leveldb .
> > >> >> >
> > >> >> >
> > >> >> > On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
> > >> >> > <[email protected]> wrote:
> > >> >> > >
> > >> >> > > Thats expected from Ceph by design. But in our case, we are using 
> > >> >> > > all
> > >> >> > > recommendation like rack failure domain, replication n/w,etc, 
> > >> >> > > still
> > >> >> > > face client IO performance issues during one OSD down..
> > >> >> > >
> > >> >> > > On Tue, Feb 19, 2019 at 10:56 PM David Turner 
> > >> >> > > <[email protected]> wrote:
> > >> >> > > >
> > >> >> > > > With a RACK failure domain, you should be able to have an 
> > >> >> > > > entire rack powered down without noticing any major impact on 
> > >> >> > > > the clients.  I regularly take down OSDs and nodes for 
> > >> >> > > > maintenance and upgrades without seeing any problems with 
> > >> >> > > > client IO.
> > >> >> > > >
> > >> >> > > > On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy 
> > >> >> > > > <[email protected]> wrote:
> > >> >> > > >>
> > >> >> > > >> Hello - I have a couple of questions on ceph cluster 
> > >> >> > > >> stability, even
> > >> >> > > >> we follow all recommendations as below:
> > >> >> > > >> - Having separate replication n/w and data n/w
> > >> >> > > >> - RACK is the failure domain
> > >> >> > > >> - Using SSDs for journals (1:4ratio)
> > >> >> > > >>
> > >> >> > > >> Q1 - If one OSD down, cluster IO down drastically and customer 
> > >> >> > > >> Apps impacted.
> > >> >> > > >> Q2 - what is stability ratio, like with above, is ceph cluster
> > >> >> > > >> workable condition, if one osd down or one node down,etc.
> > >> >> > > >>
> > >> >> > > >> Thanks
> > >> >> > > >> Swami
> > >> >> > > >> _______________________________________________
> > >> >> > > >> ceph-users mailing list
> > >> >> > > >> [email protected]
> > >> >> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >> >> > > _______________________________________________
> > >> >> > > ceph-users mailing list
> > >> >> > > [email protected]
> > >> >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph cluster stability

Reply via email to