Re: [ceph-users] Replication lag in block storage

Ирек Фасихов Sat, 15 Mar 2014 08:50:26 -0700

Which model you have hard drives?


2014-03-14 21:59 GMT+04:00 Greg Poirier <[email protected]>:

> We are stressing these boxes pretty spectacularly at the moment.
>
> On every box I have one OSD that is pegged for IO almost constantly.
>
> ceph-1:
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sdv               0.00     0.00  104.00  160.00   748.00  1000.00    13.24
>     1.15    4.36    9.46    1.05   3.70  97.60
>
> ceph-2:
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sdq               0.00    25.00  109.00  218.00   844.00  1773.50    16.01
>     1.37    4.20    9.03    1.78   3.01  98.40
>
> ceph-3:
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sdm               0.00     0.00  126.00   56.00   996.00   540.00    16.88
>     1.01    5.58    8.06    0.00   5.43  98.80
>
> These are all disks in my block storage pool.
>
>      osdmap e26698: 102 osds: 102 up, 102 in
>       pgmap v6752413: 4624 pgs, 3 pools, 14151 GB data, 21729 kobjects
>             28517 GB used, 65393 GB / 93911 GB avail
>                 4624 active+clean
>   client io 1915 kB/s rd, 59690 kB/s wr, 1464 op/s
>
> I don't see any smart errors, but i'm slowly working my way through all of
> the disks on these machines with smartctl to see if anything stands out.
>
>
> On Fri, Mar 14, 2014 at 9:52 AM, Gregory Farnum <[email protected]> wrote:
>
>> On Fri, Mar 14, 2014 at 9:37 AM, Greg Poirier <[email protected]>
>> wrote:
>> > So, on the cluster that I _expect_ to be slow, it appears that we are
>> > waiting on journal commits. I want to make sure that I am reading this
>> > correctly:
>> >
>> >           "received_at": "2014-03-14 12:14:22.659170",
>> >
>> >                     { "time": "2014-03-14 12:14:22.660191",
>> >                       "event": "write_thread_in_journal_buffer"},
>> >
>> > At this point we have received the write and are attempting to write the
>> > transaction to the OSD's journal, yes?
>> >
>> > Then:
>> >
>> >                     { "time": "2014-03-14 12:14:22.900779",
>> >                       "event": "journaled_completion_queued"},
>> >
>> > 240ms later we have successfully written to the journal?
>>
>> Correct. That seems an awfully long time for a 16K write, although I
>> don't know how much data I have on co-located journals. (At least, I'm
>> assuming it's in the 16K range based on the others, although I'm just
>> now realizing that subops aren't providing that information...I've
>> created a ticket to include that diagnostic info in future.)
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> > I expect this particular slowness due to colocation of journal and data
>> on
>> > the same disk (and it's a spinning disk, not an SSD). I expect some of
>> this
>> > could be alleviated by migrating journals to SSDs, but I am looking to
>> > rebuild in the near future--so am willing to hobble in the meantime.
>> >
>> > I am surprised that our all SSD cluster is also underperforming. I am
>> trying
>> > colocating the journal on the same disk with all SSDs at the moment and
>> will
>> > see if the performance degradation is of the same nature.
>> >
>> >
>> >
>> > On Thu, Mar 13, 2014 at 6:25 PM, Gregory Farnum <[email protected]>
>> wrote:
>> >>
>> >> Right. So which is the interval that's taking all the time? Probably
>> >> it's waiting for the journal commit, but maybe there's something else
>> >> blocking progress. If it is the journal commit, check out how busy the
>> >> disk is (is it just saturated?) and what its normal performance
>> >> characteristics are (is it breaking?).
>> >> -Greg
>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
>> >>
>> >>
>> >> On Thu, Mar 13, 2014 at 5:48 PM, Greg Poirier <[email protected]
>> >
>> >> wrote:
>> >> > Many of the sub ops look like this, with significant lag between
>> >> > received_at
>> >> > and commit_sent:
>> >> >
>> >> >         { "description": "osd_op(client.6869831.0:1192491
>> >> > rbd_data.67b14a2ae8944a.0000000000009105 [write 507904~3686400]
>> >> > 6.556a4db0
>> >> > e660)",
>> >> >           "received_at": "2014-03-13 20:42:05.811936",
>> >> >           "age": "46.088198",
>> >> >           "duration": "0.038328",
>> >> > <snip>
>> >> >                     { "time": "2014-03-13 20:42:05.850215",
>> >> >                       "event": "commit_sent"},
>> >> >                     { "time": "2014-03-13 20:42:05.850264",
>> >> >                       "event": "done"}]]},
>> >> >
>> >> > In this case almost 39ms between received_at and commit_sent.
>> >> >
>> >> > A particularly egregious example of 80+ms lag between received_at and
>> >> > commit_sent:
>> >> >
>> >> >        { "description": "osd_op(client.6869831.0:1190526
>> >> > rbd_data.67b14a2ae8944a.0000000000008fac [write 3325952~868352]
>> >> > 6.5255f5fd
>> >> > e660)",
>> >> >           "received_at": "2014-03-13 20:41:40.227813",
>> >> >           "age": "320.017087",
>> >> >           "duration": "0.086852",
>> >> > <snip>
>> >> >                     { "time": "2014-03-13 20:41:40.314633",
>> >> >                       "event": "commit_sent"},
>> >> >                     { "time": "2014-03-13 20:41:40.314665",
>> >> >                       "event": "done"}]]},
>> >> >
>> >> >
>> >> >
>> >> > On Thu, Mar 13, 2014 at 4:17 PM, Gregory Farnum <[email protected]>
>> >> > wrote:
>> >> >>
>> >> >> On Thu, Mar 13, 2014 at 3:56 PM, Greg Poirier <
>> [email protected]>
>> >> >> wrote:
>> >> >> > We've been seeing this issue on all of our dumpling clusters, and
>> I'm
>> >> >> > wondering what might be the cause of it.
>> >> >> >
>> >> >> > In dump_historic_ops, the time between op_applied and
>> >> >> > sub_op_commit_rec
>> >> >> > or
>> >> >> > the time between commit_sent and sub_op_applied is extremely high.
>> >> >> > Some
>> >> >> > of
>> >> >> > the osd_sub_ops are as long as 100 ms. A sample dump_historic_ops
>> is
>> >> >> > included at the bottom.
>> >> >>
>> >> >> It's important to understand what each of those timestamps are
>> >> >> reporting.
>> >> >>
>> >> >> op_applied: the point at which an OSD has applied an operation to
>> its
>> >> >> readable backing filesystem in-memory (which for xfs or ext4 will be
>> >> >> after it's committed to the journal)
>> >> >> sub_op_commit_rec: the point at which an OSD has gotten commits from
>> >> >> the replica OSDs
>> >> >> commit_sent: the point at which a replica OSD has sent a commit back
>> >> >> to its primary
>> >> >> sub_op_applied: the point at which a replica OSD has applied a
>> >> >> particular operation to its backing filesystem in-memory (again,
>> after
>> >> >> the journal if using xfs)
>> >> >>
>> >> >> Reads are never served from replicas, so a long time between
>> >> >> commit_sent and sub_op_applied should not in itself be an issue. A
>> lag
>> >> >> time between op_applied and sub_op_commit_rec means that the OSD is
>> >> >> waiting on its replicas. A long time there indicates either that the
>> >> >> replica is processing slowly, or that there's some issue in the
>> >> >> communications stack (all the way from the raw ethernet up to the
>> >> >> message handling in the OSD itself).
>> >> >> So the first thing to look for are sub ops which have a lag time
>> >> >> between the received_at and commit_sent timestamps. If none of those
>> >> >> ever turn up, but unusually long waits for sub_op_commit_rec are
>> still
>> >> >> present, then it'll take more effort to correlate particular subops
>> on
>> >> >> replicas with the op on the primary they correspond to, and see
>> where
>> >> >> the time lag is coming into it.
>> >> >> -Greg
>> >> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
>> >> >
>> >> >
>> >
>> >
>>
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Replication lag in block storage

Reply via email to