> On Dec 5, 2016, at 9:42 PM, Christian Balzer <[email protected]> wrote: > > > Hello, > > On Tue, 6 Dec 2016 03:37:32 +0100 Christian Theune wrote: > >> Hi Christian (heh), >> >> thanks for picking this up. :) >> >> This has become a rather long post as I added more details and giving >> our history, but if we make progress then maybe this can help others in >> the future. I find slow requests extremely hard to debug and as I said: >> aside from scratching my own itch, I’d be happy to help future >> travellers. >> >>> On 6 Dec 2016, at 00:59, Christian Balzer <[email protected]> wrote: >>> >>> Hello, >>> >>> On Mon, 5 Dec 2016 15:25:37 +0100 Christian Theune wrote: >>> >>>> Hi, >>>> >>>> we’re currently expanding our cluster to grow the number of IOPS we >>>> can provide to clients. We’re still on Hammer but in the process of >>>> upgrading to Jewel. >>> >>> You might want to wait until the next Jewel release, given the current >>> number of issues. >> >> Our issues or Jewel’s? (This is only half a joke, a friend’s Ceph >> cluster’s issues on Jewel is making me quite nervous and I’m >> re-evaluating postponing already.) >> > Jewel issues, like the most recent one with scrub sending OSDs to > neverland. > >>>> We started adding pure-SSD OSDs in the last days (based on MICRON >>> S610DC-3840) and the slow requests we’ve seen in the past have started >>> to show a different pattern. >>>> >>> I looked in the archives and can't find a full description of your >>> cluster hardware in any posts from you or the other Christian (hint, >>> hint). Slow requests can nearly all the time being traced back to HW >>> issues/limitations. >> >> We’re currently running on 42% Christians. ;) >> >> We currently have hosts of 2 generations in our cluster. They’re both >> SuperMicro, sold by Thomas Krenn. >> >> Type 1 (4 Hosts) >> >> SuperMicro X9DR3-F >> 64 GiB RAM >> 2 Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz > > A bit on the weak side, if you'd be doing lots of small IOPS, as you found > out later. Not likely a real problem here, though. > >> LSI MegaRAID SAS 9271-8i > And as you already found out, not a good match for Ceph (unless you actual > do RAIDs, like me with Areca) > Also LSI tends to have issues unless you're on the latest drivers > (firmware and kernel side)
Currently have our OSD’s behind an LSI 3108 RoC, in individual RAID 0 virtual disks. Curious to see if this looks to be an issue in this case, as we also see slow requests, though we are completely on-disk journaling, though evaluating NVMe journaling possibly as soon as today. While not idea, they are BBU backed, which is the result of being hurt from a power outage during PoC stage, when we used on-device write cache, and corrupted the leveldb beyond repair. So while an HBA would be ‘ideal,’ the BBU backed RAID has added a level of resiliency that could be helpful to some (verified with further destructive testing both with disk cache enabled/disabled to verify reproducibility). > >> Dual 10-Gigabit X540-AT2 >> OS on RAID 1 SATA HGST HUS724020ALS640 1.818 TB 7.2k >> Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those >> should be DCP3700, maybe my inventory is wrong) >> > That needs clarification, which ones are they now? > >> Pool “rbd.ssd” >> >> 2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0) >> > You're clearly a much braver man than me or have a very light load (but at > least the 3510's have 1 DWPD endurance unlike the 0.3 of 3500s). > Please clarify, journals inline (as you suggest below) or also on the NVME? > >> Pool “rbd.hdd” >> >> 5-6 OSDs on SATA HGST HUS724020ALS640 1.818 TB 7.2k (jbod/raid 0) >> 1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0) > Same here, journal inline or also on the NVME? > >> >> Type 2 (2 Hosts) >> >> SuperMicro X9SRL-F >> 32 GiB RAM >> 2 Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz >> LSI MegaRAID SAS 9260-4i >> Dual 10-Gigabit X540-AT2 >> OS on RAID 1 SATA (HGST HUS724020ALS640 1.818 TB) >> Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those >> should be DCP3700, maybe my inventory is wrong) >> >> Pool “rbd.ssd” >> >> 2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0) >> >> Pool “rbd.hdd” >> >> 7 OSDs on HITACHI HUS156060VLS600 558.406 GB 15k (jbod/raid 0) >> 1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0) >> >> Network >> >> 10 GE Copper on Brocade VDX 6740-T Switches w/ 20 GBit interconnect >> Storage Access and Cluster network on 2 physically separated ports >> Some KVM servers are still running 1 GE interfaces for access, some >> 10 GE >> >> Software >> >> Ceph: 0.94.7 >> Filesystems: OS on ext4, OSDs on xfs >> Kernel: 4.4.27 (one host of type 2 still waiting for reboot to >> upgrade from 4.1.16 in the next days) Qemu: 2.7 with librbd >> >> Overall we’ve been more happy with the second type of hosts (the 15k >> disks obviously don’t reach their limit in a mixed environment), but >> generally we’ve been unhappy with having RAID controllers in there at >> all. We’re going for pure HBA setups with newer machines. >> > Nods. > >> This cluster setup has been with us for a few years now and has seen >> various generations of configurations ranging from initial huge OSD on >> RAID 5, over temporary 2-disk RAID 0, and using CacheCade (until it died >> ungracefully). Now it’s just single disk jbod/raid 0 with journals on >> NVMe SSD. We added a pure SSD Pool with inline journals a while ago >> which does not show any slow requests for the last 14 days or so. > > I'd be impressed to see slow requests from that pool, even with 3510s. > Because that level of activity should cause your HDD pool to come to a > total standstill. > >> We added 10GE about 18 months ago and migrated that from a temporary >> solution to Brocade in September. >> >>> Have you tested these SSDs for their suitability as a Ceph journal (I >>> assume inline journals)? >>> This has been discussed here countless times, google. >> >> We’re testing them right now. We stopped testing this in our development >> environment as any learning aside from “works generally”/“doesn’t work >> at all” has not been portable to our production load at all. All our >> experiments trying to reliably replicate production traffic in a >> reasonable way have failed. >> >> I’ve FIOed them and was able to see their spec’d write IOPS (18k random >> IO 4k with 32 QD) and throughput. I wasn’t able to see 100k read IOPS >> but maxed out around 30k. >> > Not FIO (the one you cite below) nor IOPS are of relevance to journals, > sequential sync writes are. Find the most recent discussions here and for a > slightly dated info with dubious data/results see: > https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ > > <https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/> > > >>> If those SSDs aren't up to snuff, they may be worse than plain disks in >>> some cases. >>> >>> In addition, these SSDs have an endurance of 1 DWPD, less than 0.5 when >>> factoring journal and FS overheads and write amplification scenarios. >>> I'd be worried about wearing these out long before 5 years are over. >> >> I reviewed the endurance from our existing hard drives. The hard drives >> had less than 0.05 DWPD since we’ve added them. I wasn’t able to get >> data from the NVMe drives about how much data they actually wrote to be >> able to compute endurance. The Intel tool currently shows an endurance >> analyzer of 14.4 years. This might be a completely useless number due to >> aggressive over-provisioning of the SSD. > > No SMART interface for the NVMes? I don't have any, so I wouldn't know. Smartmontools says they added basic NVMe support in 6.5, but I haven’t jumped from 6.2 yet. Intel’s SSD Data Center Tool (isdct) is not entirely intuitive, but provides some SMART statistics, including NAND Bytes Written > # isdct show -o json -smart F4 -intelssd > { > "SMART Attributes PHMD54560053400AGN": > { > "F4": > { > "Description":"NAND Bytes Written", > "ID":"F4", > "Normalized":100, > "Raw":1594960 > } > } > } There are some other statistics worth checking, including the ‘SmartHealthInfo’ nvmelog. Reed > > Well, at 0.05 DWPD you're probably fine, so much more IOPS than bandwidth > bound. > > Some average/peak IOPS and bandwidth numbers (ceph -s) from your cluster > would give us some idea if you're trying to do the impossible or if your > HW is to blame to some extend at least. > >> >> However, I’ll be monitoring this more closely on the Microns and not >> worry about the full 5 years just yet. >> >>>> I’m currently seeing those: >>>> >>>> 2016-12-05 15:13:37.527469 osd.60 172.22.4.46:6818/19894 8080 : >>>> cluster [WRN] 5 slow requests, 1 included below; oldest blocked for > >>>> 31.675358 secs 2016-12-05 15:13:37.527478 osd.60 >>>> 172.22.4.46:6818/19894 8081 : cluster [WRN] slow request 31.674886 >>>> seconds old, received at 2016-12-05 15:13:05.852525: >>>> osd_op(client.518589944.0:2734750 >>>> rbd_data.1e2b40f879e2a9e3.00000000000000a2 [stat,set-alloc-hint >>>> object_size 4194304 write_size 4194304,write 1892352~4096] >>>> 277.ceaf1c22 ack+ondisk+write+known_if_redirected e1107736) currently >>>> waiting for rw locks >>>> >>> Is osd.60 one of these SSDs? >> >> It is. However, I can also show any other of the OSDs from the same pool >> - this isn’t restricted to the Micron SSDs but also shows for the 1.8 TB >> SATA drives.I reviewed all the logs I got from the last two weeks and >> out of all complaints those have always made up ~30% of the slow >> requests we see. Others include: >> > Again, seeing slow requests from an SSD based OSD strikes me as odd, aside > from LSI issues (which should show up in kernel logs) the main suspect > would be the journal writes totally overwhelming it. > Unlike the Intel 3xxx which are known to handle things (sync writes) > according to their specs. > >> currently waiting for subops from … (~35%) >> currently waiting for rw (~30%) >> currently no flag points reached (~20%) >> currently commit_sent (<5%) >> currently reached_pg (<3%) >> currently started (<1%) >> currently waiting for scrub (extremely rare) >> >> The statistics for waiting for sub operations include OSD 60 (~1886) >> more than any others, even though all of the others do show a >> significant amount (~100-400). > Those are not relevant (related) to this OSD. > >> My current guess is that OSD 60 has gotten more than its fair share of >> PGs temporarily and a lot of the movement on the cluster just affected it >> much more. I added 3 of the Micron SSDs as the same time and the two >> others are well in the normal range. I don’t have distribution over time, >> so I can’t prove that right now. The Microns do have a) higher primary >> affinity b) double the size than the normal SATA drives so that should >> result in more requests ending up there. >> > > Naturally. > I'm not convinced that mixing HDDs and SSDs is the way forward, either you > can handle your load with HDDs (enough of them) and fast journals, or you > can't and are better off going all SSD, unless your use case is a match > for cache-tiering. > > >> Looking at IOstat I don’t see them creating large backlogs - at least I >> can’t spot them interactively. >> > > You should see high latencies, utilization in there when the slow requests > happen. > Also look at it with atop in a larger terminal and with at least 5s > refresh interval. > >> Looking at the specific traffic that Ceph with inline journal puts on >> the device, then during backfill it appears to be saturated from a >> bandwith perspective. I see around 100MB/s going in with around 1k write >> IOPS. During this time, iostat shows an average queue length of around >> 200 requests and an average wait of around 150ms, which is quite high. > Precisely. > The journal/sync write plot thickens. > >> Adding a 4k FIO [1] at the same time doesn’t change throughput or IOPS >> significantly. Stopping the OSD then shows IOPS going up to 12k r/w, but >> of course bandwith is much less at that point. >> >> Something we have looked out for again but weren’t able to find >> something on is whether the controller might be limiting us in some way. >> Then again: the Intel SSDs are attached to the same controller and have >> not shown *any* slow request at all. All the slow requests are defined >> the the rbd.hdd pool (which now includes the Microns as part of a >> migration strategy). >> >> I’m also wondering whether we may be CPU bound, but the stats do not >> indicate that. We’re running the “ondemand” governor and I’ve seen >> references to people preferring the “performance” governor. However, I >> would expect that we’d see significantly reduced idle times when this >> becomes an issue. >> > The performance governor will help with (Ceph induced) latencies foremost. > >>> Tools like atop and iostat can give you a good insight on how your >>> storage subsystem is doing. >> >> Right. We have collectd running including the Ceph plugin which helps >> finding historical data. At the moment I’m not getting much out of the >> interactive tools as any issue will have resolved once I’m logged in. ;) >> > I suppose you could always force things with a fio from a VM or a > "rados bench". > > For now, concentrate on validating those Microns. > > Christian > >>> The moment you see slow requests, your cluster has issues you need to >>> address. >>> Especially if this happens when no scrubs are going on (restrict scrubs >>> to off-peak hours). >> >> I’ve seen this happen when no scrubs are around. I can currently not >> pinpoint performance issues to ongoing scrubs. We have limited scrubs to >> a low load threshold but run them regularly during the day. >> >>> The 30 seconds is an arbitrary value and the WRN would warrant an ERR >>> in my book, as some applications take a very dim view on being blocked >>> for more than few seconds. >> >> Absolutely: we see adverse effects within VMs much earlier and have >> lowered the complaint time to 15 seconds due to that. >> >> Here’s the Ceph config we’re running with. Some of the values are >> temporarily tuned at runtime to support operations (like max backfills, >> etc.): >> >> [global] >> fsid = ... >> public network = 172.22.4.0/24 >> cluster network = 172.22.8.0/24 >> pid file = /run/ceph/$type-$id.pid >> # should correspond to the setting in /etc/init.d/ceph >> max open files = 262144 >> mon host = <3 mons> >> mon osd nearfull ratio = .9 >> osd pool default min size = 1 >> osd pool default size = 2 >> # recovery sleep becomes available in infernalis. >> # osd recovery sleep = 0.005 >> osd snap trim sleep = 0.05 >> osd scrub sleep = 0.05 >> >> [client] >> log file = /var/log/ceph/client.log >> rbd cache = true >> rbd default format = 2 >> admin socket = /run/ceph/rbd-$pid-$cctid.asok >> >> [mon] >> admin socket = /run/ceph/$cluster-$name.asok >> mon addr = 172.22.4.42:6789,172.22.4.43:6789,172.22.4.54:6789 >> mon data = /srv/ceph/mon/$cluster-$id >> mon data avail crit = 2 >> mon data avail warn = 5 >> mon osd allow primary affinity = true >> mon pg warn max per osd = 3000 >> >> [mon.cartman07] >> host = cartman07 >> mon addr = 172.22.4.43:6789 >> public addr = 172.22.4.43:6789 >> cluster addr = 172.22.8.13:6789 >> >> [osd] >> host = cartman07 >> admin socket = /run/ceph/$cluster-$name.asok >> public addr = 172.22.4.43 >> cluster addr = 172.22.8.13 >> filestore fiemap = true >> filestore max sync interval = 5 >> # slightly more than 1 week to avoid hitting the same day the same time >> osd deep scrub interval = 700000 >> # place this between "nearfull" and "full" >> osd backfill full ratio = 0.92 >> # 2* 5 seconds * 200MB/s disk throughput = 2000MB >> osd journal size = 2000 >> osd max backfills = 1 >> osd op complaint time = 15 >> osd recovery max active = 1 >> osd recovery op priority = 1 >> osd scrub load threshold = 2 >> >> Something I started playing around with today was looking up specific >> slow requests from the admin sockets. Here’s what happened a bit >> earlier. Something I see is that the request flaps between “reached pg” >> and “waiting for rw locks”. I’ll provide a set of 3 dumps that are >> correlated. I started with OSD 30 (Micron) and pulled the ones from OSD >> 33 and 22 as they are referenced from the “waiting for subops from”. >> > > > -- > Christian Balzer Network/Systems Engineer > [email protected] <mailto:[email protected]> Global OnLine Japan/Rakuten > Communications > http://www.gol.com/ <http://www.gol.com/> > _______________________________________________ > ceph-users mailing list > [email protected] <mailto:[email protected]> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
