On Sun, Sep 7, 2014 at 4:28 PM, Alex Moore <a...@lspeed.org> wrote:
> I recently found out about the "ceph --admin-daemon
> /var/run/ceph/ceph-osd.<id>.asok dump_historic_ops" command, and noticed
> something unexpected in the output on my cluster, after checking numerous
> output samples...
>
> It looks to me like "normal" write ops on my cluster spend roughly:
>
> <1ms between "received_at" and "waiting_for_osdmap"
> <1ms between "waiting_for_osdmap" and "reached_pg"
> <15ms between "reached_pg" and "commit_sent"
> <15ms between "commit_sent" and "done"
>
> For reference, this is a small (3-host) all-SSD cluster, with monitors
> co-located with OSDs. Each host has: 1 SSD for the OS, 1 SSD for the
> journal, and 1 SSD for the OSD + monitor data (I initially had the monitor
> data on the same drive as the OS, but encountered performance problems -
> which have since been allieviated by moving the monitor data to the same
> drives as the OSDs. Networking is infiniband (8 Gbps dedicated
> point-to-point link between each pair of hosts). I'm running v0.80.5. And
> the OSDs use XFS.
>
> Anyway, as this command intentionally shows the worst few recent IOs, I only
> rarely see examples that match the above "norm". Rather, the typical
> outliers that it highlights are usually write IOs with ~100-300ms latency,
> where the extra latency exists purely between the "received_at" and
> "reached_pg" timestamps, and mostly in the "waiting_for_osdmap" step. Also
> it looks like these slow IOs come in batches. Every write IO arriving within
> the same ~1 second period will suffer from these strangely slow initial two
> steps, with the additional latency being almost identical for each one
> within the same batch. After which things return to normal again in that
> those steps take <1ms. So compared to the above "norm", these look more
> like:
>
> ~50ms between "received_at" and "waiting_for_osdmap"
> ~150ms between "waiting_for_osdmap" and "reached_pg"
> <15ms between "reached_pg" and "commit_sent"
> <15ms between "commit_sent" and "done"
>
> This seems unexpected to me. I don't see why those initial steps in the IO
> should ever take such a long time to complete. Where should I be looking
> next to track down the cause? I'm guessing that "waiting_for_osdmap"
> involves OSD<->Mon communication, and so perhaps indicates poor performance
> of the Mons. But for there to be any non-negligible delay between
> "received_at" and "waiting_for_osdmap" makes no sense to me at all.

First thing here is to explain what each of these events actually mean.
"received_at" is the point at which we *started* reading the message
off the wire. We have to finish reading it off and dispatch it to the
OSD before the next one.
"waiting_for_osdmap" is slightly misnamed; it's the point at which the
op was submitted to the OSD. It's called that because receiving a
message with a newer OSDMap epoch than we have is the most common
long-term delay in this phase, but we also have to do some other
preprocessing and queue the Op up.
"reached_pg" is the point at which the Op is dequeued by a worker
thread and has the necessary mutexes to get processed. After this
point we're going to try and actually do the operations described
(reads or writes).
"commit_sent" indicates that we've actually sent back the commit to
the client or primary OSD.
"done" indicates that the op has been completed (commit_sent doesn't
wait for the op to have been applied to the backing filesystem; this
does).

There are probably a bunch of causes for the behavior you're seeing,
but the most likely is that you've occasionally got a whole bunch of
operations going to a single object/placement group and they're taking
some time to process because they have to be serialized. This would
prevent the PG from handling newer ops while the old ones are still
being processed, and that could back up through the pipeline to slow
down the reads off the wire as well.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to