A few things. First, there is no need to deep scrub your PGs every 2 days. Schedule it out so it's closer to a month or so. If you have a really bad power hiccup, up the schedule to check for consistency.
Second, you said "Intel SSD DC S3700 1GB divided into three partitions used for Bluestore blocks.db for each OSD". How large are the partitions for each osd? What percentage of the available space of the SSD is in use? This model of SSD over-provisions nicely, but you can always help out by not provisioning all of it. However DB partitions should be given a good size. I'm thinking your 1GB is either a typo for a 1TB disk or that your DB partitions are 1GB each. Third, when talking of a distributed storage system you can never assume it isn't the network. You should really consider disabling your bond and testing with a single nic between all of your hosts. This would not be the first time I've seen a bonded network cause issues at least this bad on a cluster. Do you have cluster_network and public_network set? What does your network topology look like? On Fri, Oct 13, 2017, 11:02 PM J David <[email protected]> wrote: > Thanks all for input on this. > > It’s taken a couple of weeks, but based on the feedback from the list, > we’ve got our version of a scrub-one-at-a-time cron script running and > confirmed that it’s working properly. > > Unfortunately, this hasn’t really solved the real problem. Even with > just one scrub and one client running, client I/O requests routinely > take 30-60 seconds to complete (read or write), which is so poor that > the cluster is unusable for any sort of interactive activity. Nobody > is going to sit around and wait 30-60 seconds for a file to save or > load, or for a web server to respond, or a SQL query to finish. > > Running “ceph -w” blames this on slow requests blocked for > 32 seconds: > > 2017-10-13 21:21:34.445798 mon.ceph1 [INF] overall HEALTH_OK > 2017-10-13 21:21:51.305661 mon.ceph1 [WRN] Health check failed: 42 > slow requests are blocked > 32 sec (REQUEST_SLOW) > 2017-10-13 21:21:57.311892 mon.ceph1 [WRN] Health check update: 140 > slow requests are blocked > 32 sec (REQUEST_SLOW) > 2017-10-13 21:22:03.343443 mon.ceph1 [WRN] Health check update: 111 > slow requests are blocked > 32 sec (REQUEST_SLOW) > 2017-10-13 21:22:01.833605 osd.5 [WRN] 1 slow requests, 1 included > below; oldest blocked for > 30.526819 secs > 2017-10-13 21:22:01.833614 osd.5 [WRN] slow request 30.526819 seconds > old, received at 2017-10-13 21:21:31.306718: > osd_op(client.6104975.0:7330926 0.a2 > 0:456218c9:::rbd_data.1a24832ae8944a.000000000009d21d:head > [set-alloc-hint object_size 4194304 write_size 4194304,write > 2364416~88064] snapc 0=[] ondisk+write+known_if_redirected e18866) > currently sub_op_commit_rec from 9 > 2017-10-13 21:22:11.238561 mon.ceph1 [WRN] Health check update: 24 > slow requests are blocked > 32 sec (REQUEST_SLOW) > 2017-10-13 21:22:04.834075 osd.5 [WRN] 1 slow requests, 1 included > below; oldest blocked for > 30.291869 secs > 2017-10-13 21:22:04.834082 osd.5 [WRN] slow request 30.291869 seconds > old, received at 2017-10-13 21:21:34.542137: > osd_op(client.6104975.0:7331703 0.a2 > 0:4571f0f6:::rbd_data.1a24832ae8944a.000000000009c8ef:head > [set-alloc-hint object_size 4194304 write_size 4194304,write > 2934272~46592] snapc 0=[] ondisk+write+known_if_redirected e18866) > currently op_applied > 2017-10-13 21:22:07.834445 osd.5 [WRN] 1 slow requests, 1 included > below; oldest blocked for > 30.421122 secs > 2017-10-13 21:22:07.834452 osd.5 [WRN] slow request 30.421122 seconds > old, received at 2017-10-13 21:21:37.413260: > osd_op(client.6104975.0:7332411 0.a2 > 0:456218c9:::rbd_data.1a24832ae8944a.000000000009d21d:head > [set-alloc-hint object_size 4194304 write_size 4194304,write > 4068352~16384] snapc 0=[] ondisk+write+known_if_redirected e18866) > currently op_applied > 2017-10-13 21:22:16.238929 mon.ceph1 [WRN] Health check update: 8 slow > requests are blocked > 32 sec (REQUEST_SLOW) > 2017-10-13 21:22:21.239234 mon.ceph1 [WRN] Health check update: 4 slow > requests are blocked > 32 sec (REQUEST_SLOW) > 2017-10-13 21:22:21.329402 mon.ceph1 [INF] Health check cleared: > REQUEST_SLOW (was: 4 slow requests are blocked > 32 sec) > 2017-10-13 21:22:21.329490 mon.ceph1 [INF] Cluster is now healthy > > So far, the following steps have been taken to attempt to resolve this: > > 1) Updated to Ubuntu 16.04.3 LTS and Ceph 12.2.1. > > 2) Changes to ceph.conf: > osd max scrubs = 1 > osd scrub during recovery = false > osd deep scrub interval = 2592000 > osd scrub max interval = 2592000 > osd deep scrub randomize ratio = 0.0 > osd disk thread ioprio priority = 7 > osd disk thread ioprio class = idle > osd scrub sleep = 0.1 > > 3) Kernel I/O Scheduler set to cfq. > > 4) Deep-scrub moved to cron, with a limit of one running at a time. > > With these changes, scrubs now take 40-45 minutes to complete, up from > 20-25, so the amount of time where there are client I/O issues has > actually gotten substantially worse. > > To summarize the ceph cluster, it has five nodes. Each node has > - Intel Xeon E5-1620 v3 3.5Ghz quad core CPU > - 64GiB DDR4 1866 > - Intel SSD DC S3700 1GB divided into three partitions used from > Bluestore blocks.db for each OSD > - Separate 64GB SSD for ceph monitor data & system image. > - 3 x 7200rpm drives (Seagate Constellation ES.3 4TB or Seagate > Enterprise Capacity 8TB) > - Dual Intel 10Gigabit NIC w/LACP > > The SATA drives all check out healthy via smartctl and several are > either new and were tested right before insertion into this cluster, > or have been pulled for testing. When tested on random operations, > they are by and large capable of 120-150 IOPS and about 30MB/sec > throughput at 100% utilization with response times of 5-7ms. > > The CPUs are 75-90% idle. The RAM is largely unused (~55GiB free). > The network is nearly idle (<50Mbps TX & RX, often <10Mbps). The > blocks.db SSDs report 0% to 0.2% utilization. The system/monitor SSD > reports 0-0.5% utilization. The SATA drives report between 0 and 100% > utilization. > > If I turn off the client and just let one deep scrub run, then I’ll see > > With one scrub only (client turned off), most of the drives at > negligible utilization, but three at 60-100% utilization (all reads, > about 30MB/sec — so, pretty much flat out), and three with the > utilization match the three OSD’s that service the pg being > deep-scrubbed. > > With the client only (scrub script disabled, no scrub running), the > SATA disks run 10-30% utilization, doing 5-40 IOPS. “ceph status” > every ten seconds over two minutes reports: > client: 2196 kB/s rd, 10291 kB/s wr, 164 op/s rd, 202 op/s wr > client: 1989 kB/s rd, 14932 kB/s wr, 136 op/s rd, 194 op/s wr > client: 1569 kB/s rd, 9648 kB/s wr, 130 op/s rd, 158 op/s wr > client: 1949 kB/s rd, 11310 kB/s wr, 176 op/s rd, 173 op/s wr > client: 2515 kB/s rd, 6987 kB/s wr, 120 op/s rd, 134 op/s wr > client: 1730 kB/s rd, 7516 kB/s wr, 114 op/s rd, 114 op/s wr > client: 1745 kB/s rd, 7575 kB/s wr, 120 op/s rd, 123 op/s wr > client: 1407 kB/s rd, 4975 kB/s wr, 129 op/s rd, 121 op/s wr > client: 1161 kB/s rd, 5053 kB/s wr, 60 op/s rd, 152 op/s wr > client: 2551 kB/s rd, 7975 kB/s wr, 92 op/s rd, 140 op/s wr > client: 1310 kB/s rd, 10059 kB/s wr, 47 op/s rd, 171 op/s wr > client: 1257 kB/s rd, 8717 kB/s wr, 60 op/s rd, 164 op/s wr > AVERAGE: 1782 kB/s rd, 8753 kB/s wr, 112 op/s rd, 154 op/s wr > > With the client and one deep-scrub running, the I/O utilizations of 3 > disks are 70-100%, the rest at 0-30%. “ceph status” every ten seconds > over two minutes reports: > client: 380 kB/s rd, 770 kB/s wr, 9 op/s rd, 64 op/s wr > client: 1158 kB/s rd, 739 kB/s wr, 55 op/s rd, 44 op/s wr > client: 447 kB/s rd, 4486 kB/s wr, 20 op/s rd, 221 op/s wr > client: 467 kB/s rd, 17906 B/s wr, 35 op/s rd, 3 op/s wr > client: 507 kB/s rd, 462 kB/s wr, 19 op/s rd, 11 op/s wr > client: 795 kB/s rd, 1553 kB/s wr, 14 op/s rd, 38 op/s wr > client: 349 kB/s rd, 994 kB/s wr, 32 op/s rd, 84 op/s wr > client: 1055 kB/s rd, 83983 B/s wr, 28 op/s rd, 9 op/s wr > client: 493 kB/s rd, 363 kB/s wr, 20 op/s rd, 30 op/s wr > client: 616 kB/s rd, 2113 kB/s wr, 28 op/s rd, 137 op/s wr > client: 393 kB/s rd, 1253 kB/s wr, 42 op/s rd, 35 op/s wr > client: 495 kB/s rd, 2653 kB/s wr, 53 op/s rd, 206 op/s wr > AVERAGE: 596 kB/s rd, 1290 kB/s wr, 30 op/s rd, 74 op/s wr > > The client workload was the same in both cases. > > So the performance impact of a single deep scrub on a single client is > 67% less read throughput, 85% less write throughput, 73% less read > IOPS, and 52% less IOPS. > > (This is still a dramatic improvement from where we started; without > the ceph.conf tuning and with 2-3 scrubs running, it was fairly > routine for the ceph -w to show no client read requests completing at > all for over a minute.) > > But it seems like even with these settings, the deep scrub is still > capable of starving the client. > > For example, with this PG being deep-scrubbed: > > $ sudo ceph pg dump | fgrep scrub > dumped all > 0.4c 10331 0 0 0 0 > 43330306048 1566 1566 active+clean+scrubbing+deep 2017-10-13 > 22:57:48.482152 18866’9181844 18866:6689009 [6,10,5] 6 > [6,10,5] 6 18765'9167507 2017-10-12 17:38:50.990794 > 18666'8968567 2017-09-29 01:28:42.558810 > > The affected OSDs are 6, 10, and 5. > > When things go amiss: > > 2017-10-13 23:04:44.217383 mon.ceph1 [INF] Cluster is now healthy > 2017-10-13 23:05:02.963973 mon.ceph1 [WRN] Health check failed: 3 slow > requests are blocked > 32 sec (REQUEST_SLOW) > 2017-10-13 23:05:08.445170 mon.ceph1 [WRN] Health check update: 19 > slow requests are blocked > 32 sec (REQUEST_SLOW) > 2017-10-13 23:05:14.705356 mon.ceph1 [WRN] Health check update: 34 > slow requests are blocked > 32 sec (REQUEST_SLOW) > 2017-10-13 23:05:20.884466 mon.ceph1 [INF] Health check cleared: > REQUEST_SLOW (was: 12 slow requests are blocked > 32 sec) > 2017-10-13 23:05:20.884522 mon.ceph1 [INF] Cluster is now healthy > > Running “ceph health detail” repeatedly confirms that it is (all of) > the OSDs being scrubbed that are stalling: > > HEALTH_OK > > HEALTH_WARN 3 slow requests are blocked > 32 sec > REQUEST_SLOW 3 slow requests are blocked > 32 sec > 3 ops are blocked > 32.768 sec > osd.5 has blocked requests > 32.768 sec > > HEALTH_WARN 3 slow requests are blocked > 32 sec > REQUEST_SLOW 3 slow requests are blocked > 32 sec > 3 ops are blocked > 32.768 sec > osd.5 has blocked requests > 32.768 sec > > HEALTH_WARN 38 slow requests are blocked > 32 sec > REQUEST_SLOW 38 slow requests are blocked > 32 sec > 38 ops are blocked > 32.768 sec > osds 5,6 have blocked requests > 32.768 sec > > HEALTH_WARN 38 slow requests are blocked > 32 sec > REQUEST_SLOW 38 slow requests are blocked > 32 sec > 38 ops are blocked > 32.768 sec > osds 5,6 have blocked requests > 32.768 sec > > HEALTH_WARN 19 slow requests are blocked > 32 sec > REQUEST_SLOW 19 slow requests are blocked > 32 sec > 19 ops are blocked > 32.768 sec > osds 5,6 have blocked requests > 32.768 sec > > HEALTH_WARN 19 slow requests are blocked > 32 sec > REQUEST_SLOW 19 slow requests are blocked > 32 sec > 19 ops are blocked > 32.768 sec > osds 5,6 have blocked requests > 32.768 sec > > HEALTH_WARN 35 slow requests are blocked > 32 sec > REQUEST_SLOW 35 slow requests are blocked > 32 sec > 35 ops are blocked > 32.768 sec > osds 5,6,10 have blocked requests > 32.768 sec > > HEALTH_WARN 35 slow requests are blocked > 32 sec > REQUEST_SLOW 35 slow requests are blocked > 32 sec > 35 ops are blocked > 32.768 sec > osds 5,6,10 have blocked requests > 32.768 sec > > HEALTH_WARN 29 slow requests are blocked > 32 sec > REQUEST_SLOW 29 slow requests are blocked > 32 sec > 29 ops are blocked > 32.768 sec > osds 6,10 have blocked requests > 32.768 sec > > HEALTH_WARN 34 slow requests are blocked > 32 sec > REQUEST_SLOW 34 slow requests are blocked > 32 sec > 34 ops are blocked > 32.768 sec > osds 6,10 have blocked requests > 32.768 sec > > HEALTH_WARN 34 slow requests are blocked > 32 sec > REQUEST_SLOW 34 slow requests are blocked > 32 sec > 34 ops are blocked > 32.768 sec > osds 6,10 have blocked requests > 32.768 sec > > HEALTH_WARN 34 slow requests are blocked > 32 sec > REQUEST_SLOW 34 slow requests are blocked > 32 sec > 34 ops are blocked > 32.768 sec > osds 6,10 have blocked requests > 32.768 sec > > HEALTH_WARN 34 slow requests are blocked > 32 sec > REQUEST_SLOW 34 slow requests are blocked > 32 sec > 34 ops are blocked > 32.768 sec > osds 6,10 have blocked requests > 32.768 sec > > HEALTH_WARN 12 slow requests are blocked > 32 sec > REQUEST_SLOW 12 slow requests are blocked > 32 sec > 12 ops are blocked > 32.768 sec > osd.10 has blocked requests > 32.768 sec > > HEALTH_OK > > A different sample taken later where ceph -w happened to spit out the > details of slow requests shows that the slowness follows the OSDs, not > the PG being deep-scrubbed: > > $ sudo ceph pg dump | fgrep scrub > dumped all > 0.1a3 10479 0 0 0 0 > 43951919104 1564 1564 active+clean+scrubbing+deep 2017-10-14 > 02:13:06.046930 18866’9070989 18866:6495502 [14,9,15] 14 > [14,9,15] 14 18866'9060892 2017-10-13 11:22:33.563810 > 18666'8855616 2017-09-29 08:14:06.935978 > > Although all of the affected requests shown are on an OSD > participating in a deep-scrub, none of the ones shown are for the PG > being scrubbed (0.1a3, in this case): > > 2017-10-14 02:30:58.876111 osd.15 [WRN] 54 slow requests, 5 included > below; oldest blocked for > 40.738962 secs > 2017-10-14 02:30:58.876117 osd.15 [WRN] slow request 30.042803 seconds > old, received at 2017-10-14 02:30:28.833218: > osd_op(client.6210820.0:962249 0.167 0.6a309767 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:30:58.876121 osd.15 [WRN] slow request 30.350780 seconds > old, received at 2017-10-14 02:30:28.525241: > osd_op(client.6198346.0:931036 0.4f 0.66fafa4f (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:30:58.876125 osd.15 [WRN] slow request 34.967224 seconds > old, received at 2017-10-14 02:30:23.908796: > osd_op(client.6210820.0:961833 0.167 0.f744d67 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:30:58.876129 osd.15 [WRN] slow request 33.858779 seconds > old, received at 2017-10-14 02:30:25.017242: > osd_op(client.6210820.0:961949 0.4f 0.fb8a24f (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:30:58.876152 osd.15 [WRN] slow request 32.730421 seconds > old, received at 2017-10-14 02:30:26.145599: > osd_op(client.6210820.0:962067 0.4f 0.2ea1fa4f (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:30:59.876364 osd.15 [WRN] 47 slow requests, 5 included > below; oldest blocked for > 39.876936 secs > 2017-10-14 02:30:59.876371 osd.15 [WRN] slow request 30.347226 seconds > old, received at 2017-10-14 02:30:29.529043: > osd_op(client.6198346.0:931188 0.12b 0.c13f692b (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:30:59.876375 osd.15 [WRN] slow request 30.346784 seconds > old, received at 2017-10-14 02:30:29.529485: > osd_op(client.6198346.0:931189 0.12b 0.c13f692b (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:30:59.876379 osd.15 [WRN] slow request 30.899518 seconds > old, received at 2017-10-14 02:30:28.976750: > osd_op(client.6198346.0:931114 0.4f 0.b02f084f (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:30:59.876383 osd.15 [WRN] slow request 30.899370 seconds > old, received at 2017-10-14 02:30:28.976898: > osd_op(client.6198346.0:931115 0.4f 0.b02f084f (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:30:59.876386 osd.15 [WRN] slow request 30.851203 seconds > old, received at 2017-10-14 02:30:29.025066: > osd_op(client.6198346.0:931116 0.4f 0.b02f084f (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:00.876599 osd.15 [WRN] 52 slow requests, 5 included > below; oldest blocked for > 40.877172 secs > 2017-10-14 02:31:00.876605 osd.15 [WRN] slow request 30.883711 seconds > old, received at 2017-10-14 02:30:29.992794: > osd_op(client.6198346.0:931211 0.36 0.b9aba836 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:00.876610 osd.15 [WRN] slow request 30.513765 seconds > old, received at 2017-10-14 02:30:30.362740: > osd_op(client.6198346.0:931274 0.a9 0.2d8d5aa9 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:00.876613 osd.15 [WRN] slow request 30.017060 seconds > old, received at 2017-10-14 02:30:30.859445: > osd_op(client.6210820.0:962360 0.a9 0.7db19aa9 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:00.876617 osd.15 [WRN] slow request 31.501782 seconds > old, received at 2017-10-14 02:30:29.374722: > osd_op(client.6198346.0:931183 0.167 0.c0db5367 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:00.876620 osd.15 [WRN] slow request 30.288990 seconds > old, received at 2017-10-14 02:30:30.587515: > osd_op(client.6210820.0:962354 0.4f 0.d04a024f (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:01.876820 osd.15 [WRN] 54 slow requests, 5 included > below; oldest blocked for > 41.877395 secs > 2017-10-14 02:31:01.876826 osd.15 [WRN] slow request 30.051551 seconds > old, received at 2017-10-14 02:30:31.825177: > osd_op(client.6198346.0:931372 0.a9 0.ae2afca9 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:01.876830 osd.15 [WRN] slow request 31.180720 seconds > old, received at 2017-10-14 02:30:30.696008: > osd_op(client.6198346.0:931303 0.4f 0.d960a04f (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:01.876833 osd.15 [WRN] slow request 30.097627 seconds > old, received at 2017-10-14 02:30:31.779100: > osd_op(client.6198346.0:931371 0.a9 0.ae2afca9 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:01.876837 osd.15 [WRN] slow request 36.859324 seconds > old, received at 2017-10-14 02:30:25.017404: > osd_op(client.6210820.0:961950 0.4f 0.fb8a24f (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:01.876840 osd.15 [WRN] slow request 36.106802 seconds > old, received at 2017-10-14 02:30:25.769926: > osd_op(client.6198346.0:930836 0.81 0.cdb24281 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:02.877030 osd.15 [WRN] 45 slow requests, 5 included > below; oldest blocked for > 42.630584 secs > 2017-10-14 02:31:02.877037 osd.15 [WRN] slow request 30.497084 seconds > old, received at 2017-10-14 02:30:32.379855: > osd_op(client.6198346.0:931402 0.db 0.1b5228db (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:02.877041 osd.15 [WRN] slow request 30.045499 seconds > old, received at 2017-10-14 02:30:32.831441: > osd_op(client.6198346.0:931439 0.19e 0.bf928d9e (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:02.877045 osd.15 [WRN] slow request 30.045355 seconds > old, received at 2017-10-14 02:30:32.831584: > osd_op(client.6198346.0:931440 0.19e 0.bf928d9e (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:02.877048 osd.15 [WRN] slow request 33.937728 seconds > old, received at 2017-10-14 02:30:28.939211: > osd_op(client.6198346.0:931109 0.16c 0.ecf9016c (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:02.877051 osd.15 [WRN] slow request 39.956090 seconds > old, received at 2017-10-14 02:30:22.920850: > osd_op(client.6210820.0:961696 0.167 0.74cc8167 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:03.877265 osd.15 [WRN] 48 slow requests, 4 included > below; oldest blocked for > 43.630826 secs > 2017-10-14 02:31:03.877271 osd.15 [WRN] slow request 30.265866 seconds > old, received at 2017-10-14 02:30:33.611316: > osd_op(client.6198346.0:931511 0.12b 0.f9f6932b (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:03.877275 osd.15 [WRN] slow request 30.736755 seconds > old, received at 2017-10-14 02:30:33.140427: > osd_op(client.6198346.0:931461 0.81 0.67d8881 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:03.877279 osd.15 [WRN] slow request 30.068784 seconds > old, received at 2017-10-14 02:30:33.808398: > osd_op(client.6198346.0:931521 0.167 0.c39f1b67 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:03.877283 osd.15 [WRN] slow request 34.937801 seconds > old, received at 2017-10-14 02:30:28.939380: > osd_op(client.6198346.0:931110 0.16c 0.ecf9016c (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:04.877468 osd.15 [WRN] 50 slow requests, 2 included > below; oldest blocked for > 44.631044 secs > 2017-10-14 02:31:04.877474 osd.15 [WRN] slow request 30.100364 seconds > old, received at 2017-10-14 02:30:34.777036: > osd_op(client.6198346.0:931601 0.12b 0.824c6d2b (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:04.877478 osd.15 [WRN] slow request 30.598084 seconds > old, received at 2017-10-14 02:30:34.279316: > osd_op(client.6210820.0:962473 0.1c1 0.d0030dc1 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:05.877693 osd.15 [WRN] 54 slow requests, 4 included > below; oldest blocked for > 45.631254 secs > 2017-10-14 02:31:05.877699 osd.15 [WRN] slow request 30.613464 seconds > old, received at 2017-10-14 02:30:35.264146: > osd_op(client.6210820.0:962514 0.12b 0.d9b2c92b (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:05.877703 osd.15 [WRN] slow request 30.564546 seconds > old, received at 2017-10-14 02:30:35.313064: > osd_op(client.6198346.0:931635 0.167 0.b6530d67 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:05.877707 osd.15 [WRN] slow request 30.561430 seconds > old, received at 2017-10-14 02:30:35.316180: > osd_op(client.6210820.0:962520 0.12b 0.2bc1b12b (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:05.877711 osd.15 [WRN] slow request 30.231697 seconds > old, received at 2017-10-14 02:30:35.645913: > osd_op(client.6198346.0:931670 0.19e 0.1063099e (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:06.877917 osd.15 [WRN] 45 slow requests, 4 included > below; oldest blocked for > 45.044786 secs > 2017-10-14 02:31:06.877925 osd.15 [WRN] slow request 30.860692 seconds > old, received at 2017-10-14 02:30:36.017137: > osd_op(client.6198346.0:931696 0.167 0.a0e34b67 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:06.877929 osd.15 [WRN] slow request 30.860517 seconds > old, received at 2017-10-14 02:30:36.017312: > osd_op(client.6198346.0:931697 0.167 0.a0e34b67 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:06.877933 osd.15 [WRN] slow request 30.860432 seconds > old, received at 2017-10-14 02:30:36.017397: > osd_op(client.6198346.0:931698 0.167 0.a0e34b67 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:06.877936 osd.15 [WRN] slow request 30.695287 seconds > old, received at 2017-10-14 02:30:36.182542: > osd_op(client.6210820.0:962573 0.14e 0.23c5354e (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:10.449873 mon.ceph1 [WRN] Health check update: 72 > slow requests are blocked > 32 sec (REQUEST_SLOW) > 2017-10-14 02:31:15.450202 mon.ceph1 [WRN] Health check update: 59 > slow requests are blocked > 32 sec (REQUEST_SLOW) > 2017-10-14 02:31:08.878201 osd.15 [WRN] 48 slow requests, 3 included > below; oldest blocked for > 47.045085 secs > 2017-10-14 02:31:08.878207 osd.15 [WRN] slow request 30.788227 seconds > old, received at 2017-10-14 02:30:38.089900: > osd_op(client.6210820.0:962712 0.144 0.9a2d1144 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:08.878212 osd.15 [WRN] slow request 30.788130 seconds > old, received at 2017-10-14 02:30:38.089997: > osd_op(client.6210820.0:962713 0.144 0.9a2d1144 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:08.878215 osd.15 [WRN] slow request 30.786906 seconds > old, received at 2017-10-14 02:30:38.091221: > osd_op(client.6210820.0:962747 0.1c1 0.60f887c1 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:09.878410 osd.15 [WRN] 42 slow requests, 1 included > below; oldest blocked for > 46.828631 secs > 2017-10-14 02:31:09.878417 osd.15 [WRN] slow request 30.367803 seconds > old, received at 2017-10-14 02:30:39.510546: > osd_op(client.6210820.0:962753 0.16c 0.209b656c (undecoded) > ondisk+read+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:10.878600 osd.15 [WRN] 44 slow requests, 2 included > below; oldest blocked for > 47.828813 secs > 2017-10-14 02:31:10.878625 osd.15 [WRN] slow request 30.023391 seconds > old, received at 2017-10-14 02:30:40.855140: > osd_op(client.6198346.0:931975 0.12b 0.96d4852b (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:10.878630 osd.15 [WRN] slow request 30.693944 seconds > old, received at 2017-10-14 02:30:40.184587: > osd_op(client.6198346.0:931947 0.15d 0.360d7d5d (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:11.878798 osd.15 [WRN] 45 slow requests, 1 included > below; oldest blocked for > 48.829020 secs > 2017-10-14 02:31:11.878805 osd.15 [WRN] slow request 30.753634 seconds > old, received at 2017-10-14 02:30:41.125105: > osd_op(client.6198346.0:932010 0.1c1 0.80369dc1 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:12.878984 osd.15 [WRN] 47 slow requests, 2 included > below; oldest blocked for > 49.829201 secs > 2017-10-14 02:31:12.878993 osd.15 [WRN] slow request 30.042094 seconds > old, received at 2017-10-14 02:30:42.836824: > osd_op(client.6210820.0:962766 0.14e 0.25cb5d4e (undecoded) > ondisk+read+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:12.879001 osd.15 [WRN] slow request 30.463578 seconds > old, received at 2017-10-14 02:30:42.415340: > osd_op(client.6198346.0:932067 0.db 0.895084db (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:13.879166 osd.15 [WRN] 48 slow requests, 1 included > below; oldest blocked for > 50.829393 secs > 2017-10-14 02:31:13.879172 osd.15 [WRN] slow request 30.191760 seconds > old, received at 2017-10-14 02:30:43.687351: > osd_op(client.6198346.0:932120 0.19e 0.3e2de99e (undecoded) > ondisk+read+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:14.879389 osd.15 [WRN] 48 slow requests, 4 included > below; oldest blocked for > 50.970509 secs > 2017-10-14 02:31:14.879396 osd.15 [WRN] slow request 30.787035 seconds > old, received at 2017-10-14 02:30:44.092271: > osd_op(client.6210820.0:962790 0.4f 0.925ece4f (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:14.879401 osd.15 [WRN] slow request 30.461304 seconds > old, received at 2017-10-14 02:30:44.418002: > osd_op(client.6198346.0:932145 0.4f 0.2882a64f (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:14.879404 osd.15 [WRN] slow request 30.461200 seconds > old, received at 2017-10-14 02:30:44.418106: > osd_op(client.6198346.0:932146 0.19e 0.53a7419e (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > 2017-10-14 02:31:14.879408 osd.15 [WRN] slow request 30.787593 seconds > old, received at 2017-10-14 02:30:44.091713: > osd_op(client.6210820.0:962779 0.167 0.79816d67 (undecoded) > ondisk+write+known_if_redirected e18866) currently queued_for_pg > > Even if it were possible to restrict the scrubs to low-usage hours, > running scrubs for 8 hours a day at 45 minutes each with 512 PGs would > take about 48 days to cycle through. > > There’s no evidence of any kind supporting any type of hardware issue. > > So, what’s going on here? There’s fairly obviously something > seriously wrong with this cluster (rather than with Ceph generally), > but where should I look to run this down further? > > Thanks for any advice! > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
