Which model you have hard drives?
2014-03-14 21:59 GMT+04:00 Greg Poirier <[email protected]>: > We are stressing these boxes pretty spectacularly at the moment. > > On every box I have one OSD that is pegged for IO almost constantly. > > ceph-1: > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sdv 0.00 0.00 104.00 160.00 748.00 1000.00 13.24 > 1.15 4.36 9.46 1.05 3.70 97.60 > > ceph-2: > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sdq 0.00 25.00 109.00 218.00 844.00 1773.50 16.01 > 1.37 4.20 9.03 1.78 3.01 98.40 > > ceph-3: > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sdm 0.00 0.00 126.00 56.00 996.00 540.00 16.88 > 1.01 5.58 8.06 0.00 5.43 98.80 > > These are all disks in my block storage pool. > > osdmap e26698: 102 osds: 102 up, 102 in > pgmap v6752413: 4624 pgs, 3 pools, 14151 GB data, 21729 kobjects > 28517 GB used, 65393 GB / 93911 GB avail > 4624 active+clean > client io 1915 kB/s rd, 59690 kB/s wr, 1464 op/s > > I don't see any smart errors, but i'm slowly working my way through all of > the disks on these machines with smartctl to see if anything stands out. > > > On Fri, Mar 14, 2014 at 9:52 AM, Gregory Farnum <[email protected]> wrote: > >> On Fri, Mar 14, 2014 at 9:37 AM, Greg Poirier <[email protected]> >> wrote: >> > So, on the cluster that I _expect_ to be slow, it appears that we are >> > waiting on journal commits. I want to make sure that I am reading this >> > correctly: >> > >> > "received_at": "2014-03-14 12:14:22.659170", >> > >> > { "time": "2014-03-14 12:14:22.660191", >> > "event": "write_thread_in_journal_buffer"}, >> > >> > At this point we have received the write and are attempting to write the >> > transaction to the OSD's journal, yes? >> > >> > Then: >> > >> > { "time": "2014-03-14 12:14:22.900779", >> > "event": "journaled_completion_queued"}, >> > >> > 240ms later we have successfully written to the journal? >> >> Correct. That seems an awfully long time for a 16K write, although I >> don't know how much data I have on co-located journals. (At least, I'm >> assuming it's in the 16K range based on the others, although I'm just >> now realizing that subops aren't providing that information...I've >> created a ticket to include that diagnostic info in future.) >> -Greg >> Software Engineer #42 @ http://inktank.com | http://ceph.com >> >> >> > I expect this particular slowness due to colocation of journal and data >> on >> > the same disk (and it's a spinning disk, not an SSD). I expect some of >> this >> > could be alleviated by migrating journals to SSDs, but I am looking to >> > rebuild in the near future--so am willing to hobble in the meantime. >> > >> > I am surprised that our all SSD cluster is also underperforming. I am >> trying >> > colocating the journal on the same disk with all SSDs at the moment and >> will >> > see if the performance degradation is of the same nature. >> > >> > >> > >> > On Thu, Mar 13, 2014 at 6:25 PM, Gregory Farnum <[email protected]> >> wrote: >> >> >> >> Right. So which is the interval that's taking all the time? Probably >> >> it's waiting for the journal commit, but maybe there's something else >> >> blocking progress. If it is the journal commit, check out how busy the >> >> disk is (is it just saturated?) and what its normal performance >> >> characteristics are (is it breaking?). >> >> -Greg >> >> Software Engineer #42 @ http://inktank.com | http://ceph.com >> >> >> >> >> >> On Thu, Mar 13, 2014 at 5:48 PM, Greg Poirier <[email protected] >> > >> >> wrote: >> >> > Many of the sub ops look like this, with significant lag between >> >> > received_at >> >> > and commit_sent: >> >> > >> >> > { "description": "osd_op(client.6869831.0:1192491 >> >> > rbd_data.67b14a2ae8944a.0000000000009105 [write 507904~3686400] >> >> > 6.556a4db0 >> >> > e660)", >> >> > "received_at": "2014-03-13 20:42:05.811936", >> >> > "age": "46.088198", >> >> > "duration": "0.038328", >> >> > <snip> >> >> > { "time": "2014-03-13 20:42:05.850215", >> >> > "event": "commit_sent"}, >> >> > { "time": "2014-03-13 20:42:05.850264", >> >> > "event": "done"}]]}, >> >> > >> >> > In this case almost 39ms between received_at and commit_sent. >> >> > >> >> > A particularly egregious example of 80+ms lag between received_at and >> >> > commit_sent: >> >> > >> >> > { "description": "osd_op(client.6869831.0:1190526 >> >> > rbd_data.67b14a2ae8944a.0000000000008fac [write 3325952~868352] >> >> > 6.5255f5fd >> >> > e660)", >> >> > "received_at": "2014-03-13 20:41:40.227813", >> >> > "age": "320.017087", >> >> > "duration": "0.086852", >> >> > <snip> >> >> > { "time": "2014-03-13 20:41:40.314633", >> >> > "event": "commit_sent"}, >> >> > { "time": "2014-03-13 20:41:40.314665", >> >> > "event": "done"}]]}, >> >> > >> >> > >> >> > >> >> > On Thu, Mar 13, 2014 at 4:17 PM, Gregory Farnum <[email protected]> >> >> > wrote: >> >> >> >> >> >> On Thu, Mar 13, 2014 at 3:56 PM, Greg Poirier < >> [email protected]> >> >> >> wrote: >> >> >> > We've been seeing this issue on all of our dumpling clusters, and >> I'm >> >> >> > wondering what might be the cause of it. >> >> >> > >> >> >> > In dump_historic_ops, the time between op_applied and >> >> >> > sub_op_commit_rec >> >> >> > or >> >> >> > the time between commit_sent and sub_op_applied is extremely high. >> >> >> > Some >> >> >> > of >> >> >> > the osd_sub_ops are as long as 100 ms. A sample dump_historic_ops >> is >> >> >> > included at the bottom. >> >> >> >> >> >> It's important to understand what each of those timestamps are >> >> >> reporting. >> >> >> >> >> >> op_applied: the point at which an OSD has applied an operation to >> its >> >> >> readable backing filesystem in-memory (which for xfs or ext4 will be >> >> >> after it's committed to the journal) >> >> >> sub_op_commit_rec: the point at which an OSD has gotten commits from >> >> >> the replica OSDs >> >> >> commit_sent: the point at which a replica OSD has sent a commit back >> >> >> to its primary >> >> >> sub_op_applied: the point at which a replica OSD has applied a >> >> >> particular operation to its backing filesystem in-memory (again, >> after >> >> >> the journal if using xfs) >> >> >> >> >> >> Reads are never served from replicas, so a long time between >> >> >> commit_sent and sub_op_applied should not in itself be an issue. A >> lag >> >> >> time between op_applied and sub_op_commit_rec means that the OSD is >> >> >> waiting on its replicas. A long time there indicates either that the >> >> >> replica is processing slowly, or that there's some issue in the >> >> >> communications stack (all the way from the raw ethernet up to the >> >> >> message handling in the OSD itself). >> >> >> So the first thing to look for are sub ops which have a lag time >> >> >> between the received_at and commit_sent timestamps. If none of those >> >> >> ever turn up, but unusually long waits for sub_op_commit_rec are >> still >> >> >> present, then it'll take more effort to correlate particular subops >> on >> >> >> replicas with the op on the primary they correspond to, and see >> where >> >> >> the time lag is coming into it. >> >> >> -Greg >> >> >> Software Engineer #42 @ http://inktank.com | http://ceph.com >> >> > >> >> > >> > >> > >> > > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
