Hi John,

I am trying to squize extra performance from my test cluster too
Dell R 620 with PERC 710 , RAID0, 10 GB network

Would you be willing to share your controller and kernel configuration ?

For example, I am using BIOS profile 'Performance" with the following
added to /etc/default/kernel

intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=0
idle=poll

and tuned profile  throughput-performance

All disks are configured with nr-request=1024 and read-ahead-kb=4096
SSD uses scheduled= noop while HDD uses deadline

cache policy for SSD



megacli -LDSetProp  -WT -Immediate -L0 -a0

megacli -LDSetProp  -NORA -Immediate -L0 -a0

 megacli -LDSetProp  -Direct -Immediate -L0 -a0

HDD cache policy has all caches enabled , WB and ADRA

Many thanks

Steven



On 16 February 2018 at 19:06, John Petrini <jpetr...@coredial.com> wrote:

> I thought I'd follow up on this just in case anyone else experiences
> similar issues. We ended up increasing the tcmalloc thread cache size and
> saw a huge improvement in latency. This got us out of the woods because we
> were finally in a state where performance was good enough that it was no
> longer impacting services.
>
> The tcmalloc issues are pretty well documented on this mailing list and I
> don't believe they impact newer versions of Ceph but I thought I'd at least
> give a data point. After making this change our average apply latency
> dropped to 3.46ms during peak business hours. To give you an idea of how
> significant that is here's a graph of the apply latency prior to the
> change: https://imgur.com/KYUETvD
>
> This however did not resolve all of our issues. We were still seeing high
> iowait (repeated spikes up to 400ms) on three of our OSD nodes on all
> disks. We tried replacing the RAID controller (PERC H730) on these nodes
> and while this resolved the issue on one server the two others remained
> problematic. These two nodes were configured differently than the rest.
> They'd been configured in non-raid mode while the others were configured as
> individual raid-0. This turned out to be the problem. We ended up removing
> the two nodes one at a time and rebuilding them with their disks configured
> in independent raid-0 instead of non-raid. After this change iowait rarely
> spikes above 15ms and averages <1ms.
>
> I was really surprised at the performance impact when using non-raid mode.
> While I realize non-raid bypasses the controller cache I still would have
> never expected such high latency. Dell has a whitepaper that recommends
> using individual raid-0 but their own tests show only a small performance
> advantage over non-raid. Note that we are running SAS disks, they actually
> recommend non-raid mode for SATA but I have not tested this. You can view
> the whtiepaper here: http://en.community.dell.com/
> techcenter/cloud/m/dell_cloud_resources/20442913/download
>
> I hope this helps someone.
>
> John Petrini
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to