G'day Alexandre

I'm not sure if this is causing your issues, however it could be contributing 
to them. 

I noticed you have 4 Mon's, this could contributing to your problems as its 
recommended due to paxos algorithm which ceph uses for achieving quorum of 
mon's, you should be running an odd number of mon's 1, 3, 5, 7, etc
Also worth it's mentioning running 4 mon's would still only give you a possible 
failure of 1 mon without an outage. 

Spec wise the machines look pretty good, only thing I can see is the lack of 
journals and using btrfs at this stage. 

You could try some iperf testing between the machines to make sure the 
networking is working as expected.

If you do rados benches for extended time what kind of stats do you see?

For example,

Write) 
ceph osd pool create benchmark1 XXXX XXXX
ceph osd pool set benchmark1 size 3
rados bench -p benchmark1 180 write --no-cleanup --concurrent-ios=32

* I suggest you create a 2nd benchmark pool and write for another 180 seconds 
or so to ensure nothing is cached then do a read test.

Read)
rados bench -p benchmark1 180 seq --concurrent-ios=32

You can also try the same using 4k blocks

rados bench -p benchmark1 180 write -b 4096 --no-cleanup --concurrent-ios=32
rados bench -p benchmark1 180 seq -b 4096

As you may know increasing the concurrent io's will increase cpu/disk load.

XXXX = Total PG = OSD * 100 / Replicas
Ie: 50 OSD System with 3 replicas would be around 1600

Hope this helps a little,

Cheers,
Quenten Grasso


-----Original Message-----
From: ceph-users [mailto:[email protected]] On Behalf Of 
Bécholey Alexandre
Sent: Thursday, 25 September 2014 1:27 AM
To: [email protected]
Cc: Aviolat Romain
Subject: [ceph-users] IO wait spike in VM

Dear Ceph guru,

We have a Ceph cluster (version 0.80.5 
38b73c67d375a2552d8ed67843c8a65c2c0feba6) with 4 MON and 16 OSDs (4 per host) 
used as a backend storage for libvirt.

Hosts:
Ubuntu 14.04
CPU: 2 Xeon X5650
RAM: 48 GB (no swap)
No SSD for journals
HDD: 4 WDC WD2003FYYS-02W0B0 (2 TB, 7200 rpm) dedicated to OSD (one partition 
for the journal, the rest for the OSD)
FS: btrfs (I know it's not recommended in the doc and I hope it's not the 
culprit)
Network: dedicated 10GbE

As we added some VMs to the cluster, we saw some sporadic huge IO wait on the 
VM. The hosts running the OSDs seem fine.
I followed a similar discussion here: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040621.html

Here is an example of a transaction that took some time:

        { "description": "osd_op(client.5275.0:262936 
rbd_data.22e42ae8944a.0000000000000807 [] 3.c9699248 ack+ondisk+write e3158)",
          "received_at": "2014-09-23 15:23:30.820958",
          "age": "108.329989",
          "duration": "5.814286",
          "type_data": [
                "commit sent; apply or cleanup",
                { "client": "client.5275",
                  "tid": 262936},
                [
                    { "time": "2014-09-23 15:23:30.821097",
                      "event": "waiting_for_osdmap"},
                    { "time": "2014-09-23 15:23:30.821282",
                      "event": "reached_pg"},
                    { "time": "2014-09-23 15:23:30.821384",
                      "event": "started"},
                    { "time": "2014-09-23 15:23:30.821401",
                      "event": "started"},
                    { "time": "2014-09-23 15:23:30.821459",
                      "event": "waiting for subops from 14"},
                    { "time": "2014-09-23 15:23:30.821561",
                      "event": "commit_queued_for_journal_write"},
                    { "time": "2014-09-23 15:23:30.821666",
                      "event": "write_thread_in_journal_buffer"},
                    { "time": "2014-09-23 15:23:30.822591",
                      "event": "op_applied"},
                    { "time": "2014-09-23 15:23:30.824707",
                      "event": "sub_op_applied_rec"},
                    { "time": "2014-09-23 15:23:31.225157",
                      "event": "journaled_completion_queued"},
                    { "time": "2014-09-23 15:23:31.225297",
                      "event": "op_commit"},
                    { "time": "2014-09-23 15:23:36.635085",
                      "event": "sub_op_commit_rec"},
                    { "time": "2014-09-23 15:23:36.635132",
                      "event": "commit_sent"},
                    { "time": "2014-09-23 15:23:36.635244",
                      "event": "done"}]]}

sub_op_commit_rec took about 5 seconds to complete which make me think that the 
replication is the bottleneck.

I didn't find any timeout in ceph's logs. The cluster is healthy. When some VMs 
have high IO wait, the cluster op/s is between 2 to 40.
I would gladly submit you more information if needed.

How can I dig deeper? For example is there a way to know to which OSD the 
replication is done for that specific transaction?

Cheers,
Alexandre Bécholey


_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to