I recently found out about the "ceph --admin-daemon /var/run/ceph/ceph-osd.<id>.asok dump_historic_ops" command, and noticed something unexpected in the output on my cluster, after checking numerous output samples...

It looks to me like "normal" write ops on my cluster spend roughly:

<1ms between "received_at" and "waiting_for_osdmap"
<1ms between "waiting_for_osdmap" and "reached_pg"
<15ms between "reached_pg" and "commit_sent"
<15ms between "commit_sent" and "done"

For reference, this is a small (3-host) all-SSD cluster, with monitors co-located with OSDs. Each host has: 1 SSD for the OS, 1 SSD for the journal, and 1 SSD for the OSD + monitor data (I initially had the monitor data on the same drive as the OS, but encountered performance problems - which have since been allieviated by moving the monitor data to the same drives as the OSDs. Networking is infiniband (8 Gbps dedicated point-to-point link between each pair of hosts). I'm running v0.80.5. And the OSDs use XFS.

Anyway, as this command intentionally shows the worst few recent IOs, I only rarely see examples that match the above "norm". Rather, the typical outliers that it highlights are usually write IOs with ~100-300ms latency, where the extra latency exists purely between the "received_at" and "reached_pg" timestamps, and mostly in the "waiting_for_osdmap" step. Also it looks like these slow IOs come in batches. Every write IO arriving within the same ~1 second period will suffer from these strangely slow initial two steps, with the additional latency being almost identical for each one within the same batch. After which things return to normal again in that those steps take <1ms. So compared to the above "norm", these look more like:

~50ms between "received_at" and "waiting_for_osdmap"
~150ms between "waiting_for_osdmap" and "reached_pg"
<15ms between "reached_pg" and "commit_sent"
<15ms between "commit_sent" and "done"

This seems unexpected to me. I don't see why those initial steps in the IO should ever take such a long time to complete. Where should I be looking next to track down the cause? I'm guessing that "waiting_for_osdmap" involves OSD<->Mon communication, and so perhaps indicates poor performance of the Mons. But for there to be any non-negligible delay between "received_at" and "waiting_for_osdmap" makes no sense to me at all.

Here are a couple of examples...

This is a rare example (from the perspective of the dump_historic_ops output) of what I believe to be a "more normal" write IO:

{ "description": "osd_op(client.10697.0:560068 rbd_data.1a1a2ae8944a.0000000000000722 [] 2.1ecc1a2a ack+ondisk+write e595)",
          "received_at": "2014-09-07 23:55:58.771015",
          "age": "9.101564",
          "duration": "0.010586",
          "type_data": [
                "commit sent; apply or cleanup",
                { "client": "client.10697",
                  "tid": 560068},
                [
                    { "time": "2014-09-07 23:55:58.771391",
                      "event": "waiting_for_osdmap"},
                    { "time": "2014-09-07 23:55:58.771895",
                      "event": "reached_pg"},
                    { "time": "2014-09-07 23:55:58.771978",
                      "event": "started"},
                    { "time": "2014-09-07 23:55:58.771994",
                      "event": "started"},
                    { "time": "2014-09-07 23:55:58.772041",
                      "event": "waiting for subops from 1,2"},
                    { "time": "2014-09-07 23:55:58.772210",
                      "event": "commit_queued_for_journal_write"},
                    { "time": "2014-09-07 23:55:58.775767",
                      "event": "write_thread_in_journal_buffer"},
                    { "time": "2014-09-07 23:55:58.777814",
                      "event": "journaled_completion_queued"},
                    { "time": "2014-09-07 23:55:58.777837",
                      "event": "op_commit"},
                    { "time": "2014-09-07 23:55:58.778504",
                      "event": "op_applied"},
                    { "time": "2014-09-07 23:55:58.779218",
                      "event": "sub_op_commit_rec"},
                    { "time": "2014-09-07 23:55:58.781480",
                      "event": "sub_op_commit_rec"},
                    { "time": "2014-09-07 23:55:58.781573",
                      "event": "commit_sent"},
                    { "time": "2014-09-07 23:55:58.781601",
                      "event": "done"}]]},

Whereas this is more typical of what I normally see in the output from "dump_historic_ops":

{ "description": "osd_op(client.10737.0:877375 rbd_data.25212ae8944a.00000000000001c0 [] 2.5493ce40 ack+ondisk+write e595)",
          "received_at": "2014-09-07 23:57:16.444904",
          "age": "84.769560",
          "duration": "0.269918",
          "type_data": [
                "commit sent; apply or cleanup",
                { "client": "client.10737",
                  "tid": 877375},
                [
                    { "time": "2014-09-07 23:57:16.520486",
                      "event": "waiting_for_osdmap"},
                    { "time": "2014-09-07 23:57:16.681982",
                      "event": "reached_pg"},
                    { "time": "2014-09-07 23:57:16.682030",
                      "event": "started"},
                    { "time": "2014-09-07 23:57:16.682049",
                      "event": "started"},
                    { "time": "2014-09-07 23:57:16.682119",
                      "event": "waiting for subops from 1,2"},
                    { "time": "2014-09-07 23:57:16.682292",
                      "event": "commit_queued_for_journal_write"},
                    { "time": "2014-09-07 23:57:16.682370",
                      "event": "write_thread_in_journal_buffer"},
                    { "time": "2014-09-07 23:57:16.690089",
                      "event": "journaled_completion_queued"},
                    { "time": "2014-09-07 23:57:16.691201",
                      "event": "op_commit"},
                    { "time": "2014-09-07 23:57:16.692730",
                      "event": "sub_op_commit_rec"},
                    { "time": "2014-09-07 23:57:16.700063",
                      "event": "sub_op_commit_rec"},
                    { "time": "2014-09-07 23:57:16.700105",
                      "event": "commit_sent"},
                    { "time": "2014-09-07 23:57:16.714792",
                      "event": "op_applied"},
                    { "time": "2014-09-07 23:57:16.714822",
                      "event": "done"}]]}]}

Alex
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to