Re: [ceph-users] Ceph Performance Questions with rbd images access by qemu-kvm

Wang, Warren Tue, 01 Sep 2015 10:06:34 -0700

Be selective with the SSDs you choose. I personally have tried Micron M500DC, 
Intel S3500, and some PCIE cards that would all suffice. There are MANY that do 
not work well at all. A shockingly large list, in fact.

Intel 3500/3700 are the gold standards.

Warren

From: ceph-users [mailto:[email protected]] On Behalf Of 
Kenneth Van Alstyne
Sent: Tuesday, September 01, 2015 12:50 PM
To: Robert LeBlanc <[email protected]>
Cc: [email protected]
Subject: Re: [ceph-users] Ceph Performance Questions with rbd images access by 
qemu-kvm

Got it — I’ll keep that in mind. That may just be what I need to “get by” for 
now.  Ultimately, we’re looking to buy at least three nodes of servers that can 
hold 40+ OSDs backed by 2TB+ SATA disks,

Thanks,

--
Kenneth Van Alstyne
Systems Architect
Knight Point Systems, LLC
Service-Disabled Veteran-Owned Business
1775 Wiehle Avenue Suite 101 | Reston, VA 20190
c: 228-547-8045 f: 571-266-3106
www.knightpoint.com<http://www.knightpoint.com>
DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
GSA Schedule 70 SDVOSB: GS-35F-0646S
GSA MOBIS Schedule: GS-10F-0404Y
ISO 20000 / ISO 27001

Notice: This e-mail message, including any attachments, is for the sole use of 
the intended recipient(s) and may contain confidential and privileged 
information. Any unauthorized review, copy, use, disclosure, or distribution is 
STRICTLY prohibited. If you are not the intended recipient, please contact the 
sender by reply e-mail and destroy all copies of the original message.

On Sep 1, 2015, at 11:26 AM, Robert LeBlanc 
<[email protected]<mailto:[email protected]>> wrote:

-----BEGIN PGP SIGNED MESSAGE-----

Hash: SHA256

Just swapping out spindles for SSD will not give you orders of magnitude 
performance gains as it does in regular cases. This is because Ceph has a lot 
of overhead for each I/O which limits the performance of the SSDs. In my 
testing, two Intel S3500 SSDs with an 8 core Atom (Intel(R) Atom(TM) CPU  C2750 
 @ 2.40GHz) and size=1 and fio with 8 jobs and QD=8 sync,direct 4K read/writes 
produced 2,600 IOPs. Don't get me wrong, it will help, but don't expect 
spectacular results.

- ----------------

Robert LeBlanc

PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Tue, Sep 1, 2015 at 8:01 AM, Kenneth Van Alstyne  wrote:

Thanks for the awesome advice folks.  Until I can go larger scale (50+ SATA 
disks), I’m thinking my best option here is to just swap out these 1TB SATA 
disks with 1TB SSDs.  Am I oversimplifying the short term solution?

Thanks,

- --

Kenneth Van Alstyne

Systems Architect

Knight Point Systems, LLC

Service-Disabled Veteran-Owned Business

1775 Wiehle Avenue Suite 101 | Reston, VA 20190

c: 228-547-8045 f: 571-266-3106

www.knightpoint.com<http://www.knightpoint.com/>

DHS EAGLE II Prime Contractor: FC1 SDVOSB Track

GSA Schedule 70 SDVOSB: GS-35F-0646S

GSA MOBIS Schedule: GS-10F-0404Y

ISO 20000 / ISO 27001

Notice: This e-mail message, including any attachments, is for the sole use of 
the intended recipient(s) and may contain confidential and privileged 
information. Any unauthorized review, copy, use, disclosure, or distribution is 
STRICTLY prohibited. If you are not the intended recipient, please contact the 
sender by reply e-mail and destroy all copies of the original message.

On Aug 31, 2015, at 7:29 PM, Christian Balzer  wrote:

Hello,

On Mon, 31 Aug 2015 12:28:15 -0500 Kenneth Van Alstyne wrote:

In addition to the spot on comments by Warren and Quentin, verify this by

watching your nodes with atop, iostat, etc.

The culprit (HDDs) should be plainly visible.

More inline:

Christian, et al:

Sorry for the lack of information.  I wasn’t sure what of our hardware

specifications or Ceph configuration was useful information at this

point.  Thanks for the feedback — any feedback, is appreciated at this

point, as I’ve been beating my head against a wall trying to figure out

what’s going on.  (If anything.  Maybe the spindle count is indeed our

upper limit or our SSDs really suck? :-) )

Your SSDs aren't the problem.

To directly address your questions, see answers below:

  - CBT is the Ceph Benchmarking Tool.  Since my question was more

generic rather than with CBT itself, it was probably more useful to post

in the ceph-users list rather than cbt.

  - 8 Cores are from 2x quad core Intel(R) Xeon(R) CPU E5-2609 0 @

2.40GHz

Not your problem either.

  - The SSDs are indeed Intel S3500s.  I agree — not ideal, but

supposedly capable of up to 75,000 random 4KB reads/writes.  Throughput

and longevity is quite low for an SSD, rated at about 400MB/s reads and

100MB/s writes, though.  When we added these as journals in front of the

SATA spindles, both VM performance and rados benchmark numbers were

relatively unchanged.

The only thing relevant in regards to journal SSDs is the sequential write

speed (SYNC), they don't seek and normally don't get read either.

This is why a 200GB DC S3700 is a better journal SSD than the 200GB S3710

which is faster in any other aspect but sequential writes. ^o^

Latency should have gone down with the SSD journals in place, but that's

their main function/benefit.

  - Regarding throughput vs iops, indeed — the throughput that I’m

seeing is nearly worst case scenario, with all I/O being 4KB block

size.  With RBD cache enabled and the writeback option set in the VM

configuration, I was hoping more coalescing would occur, increasing the

I/O block size.

That can only help with non-SYNC writes, so your MySQL VMs and certain

file system ops will have to bypass that and that hurts.

As an aside, the orchestration layer on top of KVM is OpenNebula if

that’s of any interest.

It is actually, as I've been eying OpenNebula (alas no Debian Jessie

packages). However not relevant to your problem indeed.

VM information:

  - Number = 15

  - Worload = Mixed (I know, I know — that’s as vague of an answer

as they come)  A handful of VMs are running some MySQL databases and

some web applications in Apache Tomcat.  One is running a syslog

server.  Everything else is mostly static web page serving for a low

number of users.

As others have mentioned, would you expect this load to work well with

just 2 HDDs and via NFS to introduce network latency?

I can duplicate the blocked request issue pretty consistently, just by

running something simple like a “yum -y update” in one VM.  While that

is running, ceph -w and ceph -s show the following: root@dashboard:~#

ceph -s cluster f79d8c2a-3c14-49be-942d-83fc5f193a25 health HEALTH_WARN

           1 requests are blocked > 32 sec

    monmap e3: 3 mons at

{storage-1=10.0.0.1:6789/0,storage-2=10.0.0.2:6789/0,storage-3=10.0.0.3:6789/0<http://10.0.0.1:6789/0,storage-2=10.0.0.2:6789/0,storage-3=10.0.0.3:6789/0>}

election epoch 136, quorum 0,1,2 storage-1,storage-2,storage-3 osdmap

e75590: 6 osds: 6 up, 6 in pgmap v3495103: 224 pgs, 1 pools, 826 GB

data, 225 kobjects 2700 GB used, 2870 GB / 5571 GB avail

                224 active+clean

 client io 3292 B/s rd, 2623 kB/s wr, 81 op/s

[snip]

466 kB/s rd, 1863 kB/s wr, 148 op/s

This is a good sample, unless your reads can be satisfied from page cache

on your storage nodes or inside your VMs (more memory for the VMs may help

here), they are competing (seeks) with your write requests. So yeah, this

is probably as good as it gets.

I never seem to get anywhere near 300 op/s.  If spindle count is indeed

the problem, is there anything else I can do to improve caching or I/O

coalescing to deal with my crippling IOP limit due to the low number of

spindles?

Other than replacing spindles with SSDs, not really.

Your client workload is too mixed for anything else but that or massively

more spindles.

On the other hand, I have a cluster with very few OSDs (4!), hundreds of

VMs and typical activities like this: 11750 kB/s wr, 1426 op/s.

Note the lack of writes, all these VMs run the same OS/application and are

basically write only.

Adding to that the OSDs are actually RAIDs behind a 4GB controller cache

and thus the "disks" aren't busy at all.

However reads, like rebooting VMs, impact this cluster quite a bit.

Christian

Thanks,

- --

Kenneth Van Alstyne

Systems Architect

Knight Point Systems, LLC

Service-Disabled Veteran-Owned Business

1775 Wiehle Avenue Suite 101 | Reston, VA 20190

c: 228-547-8045 f: 571-266-3106

www.knightpoint.com<http://www.knightpoint.com/>

DHS EAGLE II Prime Contractor: FC1 SDVOSB Track

GSA Schedule 70 SDVOSB: GS-35F-0646S

GSA MOBIS Schedule: GS-10F-0404Y

ISO 20000 / ISO 27001

Notice: This e-mail message, including any attachments, is for the sole

use of the intended recipient(s) and may contain confidential and

privileged information. Any unauthorized review, copy, use, disclosure,

or distribution is STRICTLY prohibited. If you are not the intended

recipient, please contact the sender by reply e-mail and destroy all

copies of the original message.

On Aug 31, 2015, at 11:01 AM, Christian Balzer  wrote:

Hello,

On Mon, 31 Aug 2015 08:31:57 -0500 Kenneth Van Alstyne wrote:

Sorry about the repost from the cbt list, but it was suggested I post

here as well:

I wasn't even aware a CBT (what the heck does that acronym stand for?)

existed...

I am attempting to track down some performance issues in a Ceph

cluster recently deployed.  Our configuration is as follows: 3

storage nodes,

3 nodes is, of course, bare minimum.

each with:

         - 8 Cores

Of what, apples? Detailed information makes for better replies.

         - 64GB of RAM

Ample.

         - 2x 1TB 7200 RPM Spindle

Even if your cores where to be rotten apple ones, that's very few

spindles, so your CPU is unlikely to be the bottleneck.

         - 1x 120GB Intel SSD

Details, again. From your P.S. I conclude that these are S3500's,

definitely not my choice for journals when it comes to speed and

endurance.

         - 2x 10GBit NICs (In LACP Port-channel)

Massively overspec'ed considering your storage sinks/wells aka HDDs.

The OSD pool min_size is set to “1” and “size” is set to “3”.  When

creating a new pool and running RADOS benchmarks, performance isn’t

bad — about what I would expect from this hardware configuration:

Rados bench uses by default 4MB "blocks", which is the optimum size for

(default) RBD pools.

Bandwidth does not equal IOPS (which are commonly measured in 4KB

blocks).

WRITES:

Total writes made:      207

Write size:             4194304

Bandwidth (MB/sec):     80.017

Stddev Bandwidth:       34.9212

Max bandwidth (MB/sec): 120

Min bandwidth (MB/sec): 0

Average Latency:        0.797667

Stddev Latency:         0.313188

Max latency:            1.72237

Min latency:            0.253286

RAND READS:

Total time run:        10.127990

Total reads made:     1263

Read size:            4194304

Bandwidth (MB/sec):    498.816

Average Latency:       0.127821

Max latency:           0.464181

Min latency:           0.0220425

This all looks fine, until we try to use the cluster for its purpose,

which is to house images for qemu-kvm, which are access using librbd.

Not that it probably matters, but knowing if this Openstack, Ganeti or

something else might be of interest.

I/O inside VMs have excessive I/O wait times (in the hundreds of ms at

times, making some operating systems, like Windows unusable) and

throughput struggles to exceed 10MB/s (or less).  Looking at ceph

health, we see very low op/s numbers as well as throughput and the

requests blocked number seems very high.  Any ideas as to what to look

at here?

Again, details.

How many VMs?

What are they doing?

Keep in mind that the BEST sustained result you could hope for here

(ignoring Ceph overhead and network latency) is the IOPS of 2 HDDs, so

about 300 IOPS at best. TOTAL.

   health HEALTH_WARN

          8 requests are blocked > 32 sec

   monmap e3: 3 mons at

{storage-1=10.0.0.1:6789/0,storage-2=10.0.0.2:6789/0,storage-3=10.0.0.3:6789/0<http://10.0.0.1:6789/0,storage-2=10.0.0.2:6789/0,storage-3=10.0.0.3:6789/0>}

election epoch 128, quorum 0,1,2 storage-1,storage-2,storage-3 osdmap

e69615: 6 osds: 6 up, 6 in pgmap v3148541: 224 pgs, 1 pools, 819 GB

256 or 512 PGs would have been the "correct" number here, but that's of

little importance.

data, 227 kobjects 2726 GB used, 2844 GB / 5571 GB avail

               224 active+clean

client io 3957 B/s rd, 3494 kB/s wr, 30 op/s

That's a lot of data being written for a tiny cluster like yours.

Looking at your nodes with atop or similar tools will likely reveal

that your HDDs are quite the busy beavers and can't keep up.

Also prolonged values from "ceph -w" might be educational.

Regards,

Christian

Of note, on the other list, I was asked to provide the following:

  - ceph version 0.94.1

(e4bfad3a3c51054df7e537a724c8d0bf9be972ff)

  - The SSD is split into 8GB partitions. These 8GB partitions

are used as journal devices, specified in /etc/ceph/ceph.conf.  For

example: [osd.0] host = storage-1

         osd journal

= /dev/mapper/INTEL_SSDSC2BB120G4_CVWL4363006R120LGNp1

  - rbd_cache is enabled and qemu cache is set to “writeback"

  - rbd_concurrent_management_ops is unset, so it appears the

default is “10”

Thanks,

- --

Kenneth Van Alstyne

Systems Architect

Knight Point Systems, LLC

Service-Disabled Veteran-Owned Business

1775 Wiehle Avenue Suite 101 | Reston, VA 20190

c: 228-547-8045 f: 571-266-3106

www.knightpoint.com<http://www.knightpoint.com/>

DHS EAGLE II Prime Contractor: FC1 SDVOSB Track

GSA Schedule 70 SDVOSB: GS-35F-0646S

GSA MOBIS Schedule: GS-10F-0404Y

ISO 20000 / ISO 27001

Notice: This e-mail message, including any attachments, is for the

sole use of the intended recipient(s) and may contain confidential and

privileged information. Any unauthorized review, copy, use,

disclosure, or distribution is STRICTLY prohibited. If you are not

the intended recipient, please contact the sender by reply e-mail and

destroy all copies of the original message.

- --

Christian Balzer        Network/Systems Engineer

[email protected]<mailto:[email protected]>    Global OnLine Japan/Fusion Communications

http://www.gol.com/

- --

Christian Balzer        Network/Systems Engineer

[email protected]<mailto:[email protected]>    Global OnLine Japan/Fusion Communications

http://www.gol.com/

_______________________________________________

ceph-users mailing list

[email protected]<mailto:[email protected]>

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-----BEGIN PGP SIGNATURE-----

Version: Mailvelope v1.0.2

Comment: https://www.mailvelope.com<https://www.mailvelope.com/>

wsFcBAEBCAAQBQJV5dGZCRDmVDuy+mK58QAAATYP/1rxceettpX0L591eZTq

q2zCQIgrQG11+aF4ibJhpOBmR+07+Bp+ohxERCuj5LYUxFfhzVq5sX9515Vi

GFt73l14TKkVSSQNOioETgHQxHKzl2lZAmkWLtwUHZf0xMk3r+W59EOMgUIn

1kgD+E0logMqK0N/+N24D7g4b+2ZjjPb8SIKbo40SQFQyrUxiuK1LvjlVf7m

dK9Jitl/b3wB82DxCUvwed0Fd4piLZpeNqMt6bAjAVsn015ThTYH6z9RfnIk

6oPbEsJbURj5ee6ljtmXGcTkWerIh8/FhEB7/bHyJ3VC6gK4ZPReoy4mR0KL

DMdeLO17WVUJdaayvX8+Pxqzb+PiQBsJ1L0CBg9IfOSSPDTIWzRmFsUFz8RD

ZTPs3eQxScJXIewNPchjHdrFfyUY1fbZYLKhKMSv9jcyz88TPzqnQt4pYJFJ

ocASuuF8dqq+30GKjYq4WV7dv2fHLlxrWQzlrAcI71I5HTfP8vU1Tsx/FmBu

GGItyEflBgQmvalR+tP+IuS3H8RatMvlljxwWsSjCipWaDFJZjrXvcJfKZh/

k+eZ6vBTjDAljk+95lMETw7x3AskEz1SLUuhOhIFC0E5Z+jgnBdSqeXZuAgJ

MZ1909J6V9vEVZYONFbtwDc35ShVH99Kh5tr+kmEQEEis7wlx1Ipfd4mNY0y

pmvZ

=52Nr

-----END PGP SIGNATURE-----

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Performance Questions with rbd images access by qemu-kvm

Reply via email to