A couple of suggestions:
1) # of pgs per OSD should be 100-200
2) When dealing with SSD or Flash, performance of these devices hinge on how
you partition them and how you tune linux:
a) if using partitions, did you align the partitions on a 4k
boundary? I start at sector 2048 using either fdisk or sfdisk
b) There are quite a few Linux settings that benefit SSD/Flash and
they are: Deadline io scheduler only when using the deadline associated
settings, up QDepth to 512 or 1024, set rq_affinity=2 if OS allows it,
setting read ahead if doing majority of reads, and other
3) mount options: noatime, delaylog,inode64,noquota, etc…
I have written some papers/blogs on this subject if you are interested in
seeing them.
Rick
> On Mar 3, 2016, at 2:41 AM, Adrian Saul <[email protected]> wrote:
>
> Hi Ceph-users,
>
> TL;DR - I can't seem to pin down why an unloaded system with flash based OSD
> journals has higher than desired write latencies for RBD devices. Any ideas?
>
>
> I am developing a storage system based on Ceph and an SCST+pacemaker
> cluster. Our initial testing showed promising results even with mixed
> available hardware and we proceeded to order a more designed platform for
> developing into production. The hardware is:
>
> 2x 1RU servers as "frontends" (SCST+pacemaker - ceph mons and clients using
> RBD - they present iSCSI to other systems).
> 3x 2RU OSD SSD servers (24 bay 2.5" SSD) - currently with 4 2TB Samsung Evo
> SSDs each
> 3x 4RU OSD SATA servers (36 bay) - currently with 6 8TB Seagate each
>
> As part of the research and planning we opted to put a pair of Intel PC3700DC
> 400G NVME cards in each OSD server. These are configured mirrored and setup
> as the journals for the OSD disks, the aim being to improve write latencies.
> All the machines have 128G RAM and dual E5-2630v3 CPUs, and use 4 aggregated
> 10G NICs back to a common pair of switches. All machines are running Centos
> 7, with the frontends using the 4.4.1 elrepo-ml kernel to get a later RBD
> kernel module.
>
> On the ceph side each disk in the OSD servers are setup as an individual OSD,
> with a 12G journal created on the flash mirror. I setup the SSD servers
> into one root, and the SATA servers into another and created pools using
> hosts as fault boundaries, with the pools set for 2 copies. I created the
> pools with the pg_num and pgp_num set to 32x the number of OSDs in the pool.
> On the frontends we create RBD devices and present them as iSCSI LUNs using
> SCST to clients - in this test case a Solaris host.
>
> The problem I have is that even with a lightly loaded system the service
> times for the LUNs for writes is just not getting down to where we want it,
> and they are not very stable - with 5 LUNs doing around 200 32K IOPS
> consistently the service times sit at around 3-4ms, but regularly (every
> 20-30 seconds) up to above 12-15ms which puts the average at 6ms over 5
> minutes. I fully expected we would have some latencies due to the
> distributed and networked nature of Ceph, but in this instance I just cannot
> find where these latencies are coming from, especially with the SSD based
> pool and having flash based journaling.
>
> - The RBD devices show relatively low service times, but high queue times.
> These are in line with what Solaris sees so I don't think SCST/iSCSI is
> adding much latency.
> - The journals are reporting 0.02ms service times, and seem to cope fine with
> any bursts
> - The SSDs do show similar latency variations with writes - bursting up to
> 12ms or more whenever there is high write workloads.
> - I have tried applying what tuning I can to the SSD block devices (noop
> scheduler etc) - no difference
> - I have removed any sort of smarts around IO grouping in SCST - no major
> impact
> - I have tried tuning up filesystore queue and wbthrottle values but could
> not find much difference from that.
> - Read performance is excellent, the RBD devices show little to no rwait and
> I can do benchmarks up over 1GB/s in some tests. Write throughput can also
> be good (~700MB/s).
> - I have tried using different RBD orders more in line with the iSCSI client
> block sizes (i.e 32K, 128K instead of 4M) but it seemed to make things worse.
> I would have thought better alignment would reduce latency but is that
> offset buy the extra overhead in object work?
>
> What I am looking for is what other areas do I need to look or diagnostics do
> I need to work this out? We would really like to use ceph across a mixed
> workload that includes some DB systems that are fairly latency sensitive, but
> as it stands its hard to be confident in the performance when a fairly quiet
> unloaded system seems to struggle, even with all this hardware behind it. I
> get the impression that the SSD write latencies might be coming into play as
> they are similar to the numbers I see, but really for writes I would expect
> them to be "hidden" behind the journaling.
>
> I also would have thought that being not under load and with the flash
> journals the only latency would be coming from mapping calculations on the
> client or otherwise some contention within the RBD module itself. Any ideas
> how I can break out what the times are for what the RBD module is doing?
>
> Any help appreciated.
>
> As an aside - I think Ceph as a concept is exactly what a storage system
> should be about, hence why we are using it this way. Its been awesome to get
> stuck into it and learn how it works and what it can do.
>
>
>
>
> Adrian Saul | Infrastructure Projects Team Lead
> TPG Telecom (ASX: TPM)
>
>
>
>
>
>
>
>
>
>
> Confidentiality: This email and any attachments are confidential and may be
> subject to copyright, legal or some other professional privilege. They are
> intended solely for the attention and use of the named addressee(s). They may
> only be copied, distributed or disclosed with the consent of the copyright
> owner. If you have received this email by mistake or by breach of the
> confidentiality clause, please notify the sender immediately by return email
> and delete or destroy all copies of the email. Any confidentiality, privilege
> or copyright is not waived or lost because this email has been sent to you by
> mistake.
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com