On Mon, Dec 22, 2014 at 10:30 AM, Wido den Hollander <[email protected]> wrote:
> For example, two ops:
>
> #1:
>
> { "description": "osd_sub_op(client.2433432.0:61603164 20.424
> 19038c24\/rbd_data.d7c912ae8944a.00000000000008b6\/head\/\/20 [] v
> 63283'8301089 snapset=0=[]:[] snapc=0=[])",
> "received_at": "2014-12-22 19:26:37.458680",
> "age": "2.719850",
> "duration": "2.520937",
> "type_data": [
> "commit sent; apply or cleanup",
> [
> { "time": "2014-12-22 19:26:37.458914",
> "event": "waiting_for_osdmap"},
> { "time": "2014-12-22 19:26:39.310569",
> "event": "reached_pg"},
> { "time": "2014-12-22 19:26:39.310728",
> "event": "started"},
> { "time": "2014-12-22 19:26:39.310951",
> "event": "started"},
> { "time": "2014-12-22 19:26:39.979292",
> "event": "commit_queued_for_journal_write"},
> { "time": "2014-12-22 19:26:39.979348",
> "event": "write_thread_in_journal_buffer"},
> { "time": "2014-12-22 19:26:39.979594",
> "event": "journaled_completion_queued"},
> { "time": "2014-12-22 19:26:39.979617",
> "event": "commit_sent"}]]},
>
> #2:
>
> { "description": "osd_sub_op(client.2188703.0:10420738 20.641
> 6673ee41\/rbd_data.9497e32794ff7.0000000000000454\/head\/\/20 [] v
> 63283'5215076 snapset=0=[]:[] snapc=0=[])",
> "received_at": "2014-12-22 19:26:38.040551",
> "age": "2.137979",
> "duration": "1.537128",
> "type_data": [
> "started",
> [
> { "time": "2014-12-22 19:26:38.040717",
> "event": "waiting_for_osdmap"},
> { "time": "2014-12-22 19:26:39.577609",
> "event": "reached_pg"},
> { "time": "2014-12-22 19:26:39.577624",
> "event": "started"},
> { "time": "2014-12-22 19:26:39.577679",
> "event": "started"}]]},
Oh, yep, in Firefly it's stuck in the waiting_for_osdmap state while
it's in the PG work queue as well. Whoops...
So this is probably just general slowness filling up the work queue.
> Can this be something which has to do with the amount of RBD snapshots?
> Since I see snapc involved in both ops?
It could conceivably have something to do with snapshots, but if it
does the presence of "snapc" isn't an indicator; that's always present
and is outputting the default. :)
If you're seeing disks at 100% I think stuff's just getting a little
backed up. You could also check the distribution of incoming
operations across PGs; if e.g. a flood of ops are going to one object
that could also cause issues.
-Greg
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com