Hi,
please find attached for ceph-0.48argonaut which adds the
timestamp code. The only binary which utilizes these timestamps
is the ceph-osd daemon.
Timestamps are kept in-memory.
Configuration and extraction of timestamps is
achieved through the "ceph osd tell" interface.
Timestamps are only recorded for objects which match a
filtering string.
ceph tell osd.0 code_profile help
profiling data
code_profile operations:
info: get summary info
setfilter <regex>: set filter for object IDs
completed-oids: object IDS for which complete profiling data is available
pending-oids: object IDS for which incomplete profiling data is available
get-completed [prefix]: get profiling data for completed objects
get-primary-writes [prefix]: get condensed profiling data for primary
write operations
get-secondary-writes [prefix]: get condensed profiling data for replica
write operations
get-pending [prefix]: get profiling data for pending operations
clear: free storage of profiled data and clear the object ID filter
In order to collect data for an object in a rados block device,
the object ID filter can be set like:
ceph tell osd.0 code_profile setfilter rb.0.0.*
The timestamp code adds it's own logging susbsystem and hence
is configurable via ceph.conf, like:
debug ITX = 2
>From where a timestamp is recorded is specified in the src/common/ITX.h:enum
>itx_checkpoint.
Note that the patch reflects a just a snapshot in the current
process of anlysis. Example: checkpoints ITX_OSD_DISPATCH4 and ITX_OSD_DISPATCH5
have been used in
OSD::ms_dispatch(Message *m)
just around the call to _dispatch() and show that
the most of the computing time required for dispatching is spent in this
routine.
I threrefore don't believe that throttling is causing the delays.
Best Regards
Andreas
On Thu, 16 Aug 2012 09:44:23 -0700
Yehuda Sadeh <[email protected]> wrote:
> On Thu, Aug 16, 2012 at 9:08 AM, Andreas Bluemle
> <[email protected]> wrote:
> >
> > Hi,
> >
> > I have been trying to migrate a ceph cluster (ceph-0.48argonaut)
> > to a high speed cluster network and encounter scalability problems:
> > the overall performance of the ceph cluster does not scale well
> > with an increase in the underlying networking speed.
> >
> > In short:
> >
> > I believe that the dispatching from SimpleMessenger to
> > OSD worker queues causes that scalability issue.
> >
> > Question: is it possible that this dispatching is causing
> > performance problems?
> >
> >
> > In detail:
> >
> > In order to find out more about this problem, I have added
> > profiling to the ceph code in various place; for write operations
> > to the primary or the secondary, timestamps are recorded for OSD
> > object, offset and length of the such a write request.
> >
> > Timestamps record:
> > - receipt time at SimpleMessenger
> > - processing time at osd
> > - for primary write operations: wait time until replication
> > operation is acknowledged.
>
> Did you make any code changes? We'd love to see those.
>
> >
> > What I believe is happening: dispatching requests from
> > SimpleMessenger to OSD worker threads seems to consume a fair
> > amount of time. This ends up in a widening gap between subsequent
> > receipts of requests and the start of OSD processing them.
> >
> > A primary write suffers twice from this problem: first because
> > the delay happens on the primary OSD and second because the
> > replicating OSD also suffers from the same problem - and hence
> > causes additional delays
> > at the primary OSD when it waits for the commit from the
> > replicating OSD.
> >
> > In the attached graphics, the x-axis shows the time (in seconds)
> > The y-axis shows the offset where a request to write happened.
> >
> > The red bar represents the SimpleMessenger receive, i.e. from
> > reading the message header until enqueuing the completely decoded
> > message into the SImpleMessenger dispatch queue.
>
> Could it be that messages were throttled here?
> There's a configurable that can be set (ms dispatch throttle bytes),
> might affect that.
>
> >
> > The green bar represents the time required for local processing,
> > i.e. dispatching the the OSD worker, writing to filesystem and
> > journal, send out the replication operation to the replicating OSD.
> > It right end of the green bar is the time when locally everything
> > has finished and a commit could happen.
> >
> > The blue bar represents the time until the replicating OSD has sent
> > a commit
> > back to the primary OSD and the original write request can be
> > committed to the client.
> >
> > The green bar is interrupted by a black bar: the left end represents
> > the time when the request has been enqueued on the OSD worker
> > queue. The right end gives the time when the request is taken off
> > the OSD worker queue and actual OSD processing starts.
> >
> > The test was a simple sequential write to a rados block device.
> >
> > Receiption of the write requests at the OSD is also sequential in
> > the graphics: the bar to the bottom of the graphics shows an
> > earlier write request.
> >
> > Note that the dispatching of a later request in all cases relates
> > to the enqueue time at the OSD worker queue of the previous write
> > request: the left
> > end of a black bar relates nicely to the beginning of a green bar
> > above it.
> >
> >
>
> Thanks,
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
Andreas Bluemle mailto:[email protected]
ITXperts GmbH http://www.itxperts.de
Balanstrasse 73, Geb. 08 Phone: (+49) 89 89044917
D-81541 Muenchen (Germany) Fax: (+49) 89 89044910
Company details: http://www.itxperts.de/imprint.htm
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html