Hi
* What is your cpu utilization like ? are any cores close to
saturation ?
* If you use fio to test a raw FC lun (ie prior to adding as OSD) from
your host using random 4k blocks and high queue depth (32 or more) , do
you get high iops ? what is the disk utilization ? cpu utilization ?
* If you repeat the above test but instead of testing 1 lun, run
concurrent fio test on all 5 luns on the host, does the aggregate iops
performance scale x5 ? any resource issues ?
* Does increasing /sys/block/sdX/queue/nr_requests help ?
* Can you use active/active multipath ?
* If the above gives good performance/resource utilization, would you
get better performance if you had more that 20 OSDs/luns in total, for
example 40 or 60 ? that should not cost you anything.
* I still think you can use replica of 1 in Ceph since your SAN
already has redundancy. It maybe an over-kill to use both. I am not
trying to save space on the SAN but rather reduce write latency on the
Ceph side.
Maged
On 2018-07-06 20:19, Matthew Stroud wrote:
> Good to note about the replica set, we will stick with 3. We really aren't
> concerned about the overhead, but the additional IO that occurs during writes
> that have an additional copy.
>
> To be clear, we aren't using ceph in place of FC, nor the other way around.
> We have discovered that SAN storage is cheaper (this one was surprising to
> me) and better performant than direct attached storage (DAS) on the small
> scale that we are building things (20T to about 100T). I'm sure that would
> switch if we were much larger, but for now SAN is better. In summary we are
> using SAN pretty much as a DAS and ceph uses those SAN disks for OSDs.
>
> The biggest issue we see is slow requests during rebuilds or node/osd
> failures but the disks and network just aren't being to their fullest. That
> would lead me to believe that there are some host and/or osd process
> bottlenecks going on. Other than that, just increasing the performance of our
> ceph cluster would be a plus and that is what I'm exploring.
>
> As per test numbers, I can't run that right now because the systems we have
> are in prod and I don't want to impact that for io testing. However, we do
> have a new cluster coming online shortly and I could do some benchmarking
> there and get that back to you.
>
> However as memory serves, we were only getting something about 90-100k iops
> and about 15 - 50 ms latency with 10 servers running fio with 50% of random
> and sequential workloads. With a single vm, we were getting about 14k iops
> with about 10 - 30 ms of latency.
>
> Thanks,
> Matthew Stroud
>
> On 7/6/18, 11:12 AM, "Vasu Kulkarni" <vakul...@redhat.com> wrote:
>
> On Fri, Jul 6, 2018 at 8:38 AM, Matthew Stroud <mattstr...@overstock.com>
> wrote:
>>
>> Thanks for the reply.
>>
>>
>>
>> Actually we are using fiber channel (it's so much more performant than iscsi
>> in our tests) as the primary storage and this is serving up traffic for RBD
>> for openstack, so this isn't for backups.
>>
>>
>>
>> Our biggest bottle neck is trying utilize the host and/or osd process
>> correctly. The disks are running at sub-millisecond, with about 90% of the
>> IO being pulled from the array's cache (a.k.a. not even hitting the disks).
>> According to the host, we never get north of 20% disk utilization, unless
>> there is a deep scrub going on.
>>
>>
>>
>> We have debated about putting the replica size to 2 instead of 3. However
>> this isn't much of a win for the purestorage which dedupes on the backend,
>> so having copies of data are relatively free for that unit. 1 wouldn't work
>> because this is hosting a production work load.
>
> It is a mistake to use replica of 2 for production, when one of the
> copy is corrupted its hard to fix things. if you are concerned about
> storage overhead there is an option to use EC pools in luminous. To
> get back to your original question if you are comparing the
> network/disk utilization with FC numbers than that is wrong
> comparison, They are 2 different storage systems with different
> purposes, Ceph is scale out object storage system unlike FC systems
> where you can use commodity hardware and grow as you need, you
> generally dont need hba/fc enclosed disks but nothing stopping you
> from using your existing system. Also you generally dont need any raid
> mirroring configurations in the backend since ceph will handle the
> redundancy for you. scale out systems have more work to do than
> traditional FC systems. There are minimal configuration options for
> bluestore , what kind of disk/network utilization slowdown you are
> seeing? can you publish your numbers and test data?
>>
>> Thanks,
>>
>> Matthew Stroud
>>
>>
>> From: Maged Mokhtar <mmokh...@petasan.org>
>> Date: Friday, July 6, 2018 at 7:01 AM
>> To: Matthew Stroud <mattstr...@overstock.com>
>> Cc: ceph-users <ceph-users@lists.ceph.com>
>> Subject: Re: [ceph-users] Performance tuning for SAN SSD config
>>
>>
>> On 2018-06-29 18:30, Matthew Stroud wrote:
>>
>> We back some of our ceph clusters with SAN SSD disk, particularly VSP G/F
>> and Purestorage. I'm curious what are some settings we should look into
>> modifying to take advantage of our SAN arrays. We had to manually set the
>> class for the luns to SSD class which was a big improvement. However we
>> still see situations where we get slow requests and the underlying disks and
>> network are underutilized.
>>
>>
>> More info about our setup. We are running centos 7 with Luminous as our ceph
>> release. We have 4 osd nodes that have 5x2TB disks each and they are setup
>> as bluestore. Our ceph.conf is attached with some information removed for
>> security reasons.
>>
>>
>> Thanks ahead of time.
>>
>> Thanks,
>> Matthew Stroud
>>
>> ________________________________
>>
>>
>> CONFIDENTIALITY NOTICE: This message is intended only for the use and review
>> of the individual or entity to which it is addressed and may contain
>> information that is privileged and confidential. If the reader of this
>> message is not the intended recipient, or the employee or agent responsible
>> for delivering the message solely to the intended recipient, you are hereby
>> notified that any dissemination, distribution or copying of this
>> communication is strictly prohibited. If you have received this
>> communication in error, please notify sender immediately by telephone or
>> return email. Thank you.
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> If i understand correctly, you are using luns (via iSCSI) from your external
>> SAN as OSDs and created a separate pool with these OSDs with device class
>> SSD, you are using this pool for backup.
>>
>> Some comments:
>>
>> Using external disks as OSDs is probably not that common. It may be better
>> to keep the SAN and Ceph cluster separate and have your backup tool access
>> both, it will also be safer in case of disaster to the cluster your backup
>> will be on a separate system.
>> What backup tool/script are you using ? it is better that this tool uses
>> high queue depth, large block sizes and memory/page cache to increase
>> performance during copies.
>> To try to pin down where your current bottleneck is, i would run benchmarks
>> (eg fio) using the block sizes used by your backup tool on the raw luns
>> before being added as OSDs (as pure iSCSI disks) as well as on both the main
>> and backup pools. Have a resource tool (eg atop/systat/collectl) run during
>> these tests to check for resources: disks %busy/cores %busy/io_wait
>> You probably can use replica count of 1 for the SAN OSDs since they include
>> their own RAID redundancy.
>>
>> Maged
>>
>>
>> ________________________________
>>
>> CONFIDENTIALITY NOTICE: This message is intended only for the use and review
>> of the individual or entity to which it is addressed and may contain
>> information that is privileged and confidential. If the reader of this
>> message is not the intended recipient, or the employee or agent responsible
>> for delivering the message solely to the intended recipient, you are hereby
>> notified that any dissemination, distribution or copying of this
>> communication is strictly prohibited. If you have received this
>> communication in error, please notify sender immediately by telephone or
>> return email. Thank you.
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ________________________________
>
> CONFIDENTIALITY NOTICE: This message is intended only for the use and review
> of the individual or entity to which it is addressed and may contain
> information that is privileged and confidential. If the reader of this
> message is not the intended recipient, or the employee or agent responsible
> for delivering the message solely to the intended recipient, you are hereby
> notified that any dissemination, distribution or copying of this
> communication is strictly prohibited. If you have received this communication
> in error, please notify sender immediately by telephone or return email.
> Thank you.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com