I recently found out about the "ceph --admin-daemon
/var/run/ceph/ceph-osd.<id>.asok dump_historic_ops" command, and noticed
something unexpected in the output on my cluster, after checking
numerous output samples...
It looks to me like "normal" write ops on my cluster spend roughly:
<1ms between "received_at" and "waiting_for_osdmap"
<1ms between "waiting_for_osdmap" and "reached_pg"
<15ms between "reached_pg" and "commit_sent"
<15ms between "commit_sent" and "done"
For reference, this is a small (3-host) all-SSD cluster, with monitors
co-located with OSDs. Each host has: 1 SSD for the OS, 1 SSD for the
journal, and 1 SSD for the OSD + monitor data (I initially had the
monitor data on the same drive as the OS, but encountered performance
problems - which have since been allieviated by moving the monitor data
to the same drives as the OSDs. Networking is infiniband (8 Gbps
dedicated point-to-point link between each pair of hosts). I'm running
v0.80.5. And the OSDs use XFS.
Anyway, as this command intentionally shows the worst few recent IOs, I
only rarely see examples that match the above "norm". Rather, the
typical outliers that it highlights are usually write IOs with
~100-300ms latency, where the extra latency exists purely between the
"received_at" and "reached_pg" timestamps, and mostly in the
"waiting_for_osdmap" step. Also it looks like these slow IOs come in
batches. Every write IO arriving within the same ~1 second period will
suffer from these strangely slow initial two steps, with the additional
latency being almost identical for each one within the same batch. After
which things return to normal again in that those steps take <1ms. So
compared to the above "norm", these look more like:
~50ms between "received_at" and "waiting_for_osdmap"
~150ms between "waiting_for_osdmap" and "reached_pg"
<15ms between "reached_pg" and "commit_sent"
<15ms between "commit_sent" and "done"
This seems unexpected to me. I don't see why those initial steps in the
IO should ever take such a long time to complete. Where should I be
looking next to track down the cause? I'm guessing that
"waiting_for_osdmap" involves OSD<->Mon communication, and so perhaps
indicates poor performance of the Mons. But for there to be any
non-negligible delay between "received_at" and "waiting_for_osdmap"
makes no sense to me at all.
Here are a couple of examples...
This is a rare example (from the perspective of the dump_historic_ops
output) of what I believe to be a "more normal" write IO:
{ "description": "osd_op(client.10697.0:560068
rbd_data.1a1a2ae8944a.0000000000000722 [] 2.1ecc1a2a ack+ondisk+write
e595)",
"received_at": "2014-09-07 23:55:58.771015",
"age": "9.101564",
"duration": "0.010586",
"type_data": [
"commit sent; apply or cleanup",
{ "client": "client.10697",
"tid": 560068},
[
{ "time": "2014-09-07 23:55:58.771391",
"event": "waiting_for_osdmap"},
{ "time": "2014-09-07 23:55:58.771895",
"event": "reached_pg"},
{ "time": "2014-09-07 23:55:58.771978",
"event": "started"},
{ "time": "2014-09-07 23:55:58.771994",
"event": "started"},
{ "time": "2014-09-07 23:55:58.772041",
"event": "waiting for subops from 1,2"},
{ "time": "2014-09-07 23:55:58.772210",
"event": "commit_queued_for_journal_write"},
{ "time": "2014-09-07 23:55:58.775767",
"event": "write_thread_in_journal_buffer"},
{ "time": "2014-09-07 23:55:58.777814",
"event": "journaled_completion_queued"},
{ "time": "2014-09-07 23:55:58.777837",
"event": "op_commit"},
{ "time": "2014-09-07 23:55:58.778504",
"event": "op_applied"},
{ "time": "2014-09-07 23:55:58.779218",
"event": "sub_op_commit_rec"},
{ "time": "2014-09-07 23:55:58.781480",
"event": "sub_op_commit_rec"},
{ "time": "2014-09-07 23:55:58.781573",
"event": "commit_sent"},
{ "time": "2014-09-07 23:55:58.781601",
"event": "done"}]]},
Whereas this is more typical of what I normally see in the output from
"dump_historic_ops":
{ "description": "osd_op(client.10737.0:877375
rbd_data.25212ae8944a.00000000000001c0 [] 2.5493ce40 ack+ondisk+write
e595)",
"received_at": "2014-09-07 23:57:16.444904",
"age": "84.769560",
"duration": "0.269918",
"type_data": [
"commit sent; apply or cleanup",
{ "client": "client.10737",
"tid": 877375},
[
{ "time": "2014-09-07 23:57:16.520486",
"event": "waiting_for_osdmap"},
{ "time": "2014-09-07 23:57:16.681982",
"event": "reached_pg"},
{ "time": "2014-09-07 23:57:16.682030",
"event": "started"},
{ "time": "2014-09-07 23:57:16.682049",
"event": "started"},
{ "time": "2014-09-07 23:57:16.682119",
"event": "waiting for subops from 1,2"},
{ "time": "2014-09-07 23:57:16.682292",
"event": "commit_queued_for_journal_write"},
{ "time": "2014-09-07 23:57:16.682370",
"event": "write_thread_in_journal_buffer"},
{ "time": "2014-09-07 23:57:16.690089",
"event": "journaled_completion_queued"},
{ "time": "2014-09-07 23:57:16.691201",
"event": "op_commit"},
{ "time": "2014-09-07 23:57:16.692730",
"event": "sub_op_commit_rec"},
{ "time": "2014-09-07 23:57:16.700063",
"event": "sub_op_commit_rec"},
{ "time": "2014-09-07 23:57:16.700105",
"event": "commit_sent"},
{ "time": "2014-09-07 23:57:16.714792",
"event": "op_applied"},
{ "time": "2014-09-07 23:57:16.714822",
"event": "done"}]]}]}
Alex
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com