> On Dec 5, 2016, at 9:42 PM, Christian Balzer <[email protected]> wrote:
> 
> 
> Hello,
> 
> On Tue, 6 Dec 2016 03:37:32 +0100 Christian Theune wrote:
> 
>> Hi Christian (heh),
>> 
>> thanks for picking this up. :)
>> 
>> This has become a rather long post as I added more details and giving
>> our history, but if we make progress then maybe this can help others in
>> the future. I find slow requests extremely hard to debug and as I said:
>> aside from scratching my own itch, I’d be happy to help future
>> travellers.
>> 
>>> On 6 Dec 2016, at 00:59, Christian Balzer <[email protected]> wrote:
>>> 
>>> Hello,
>>> 
>>> On Mon, 5 Dec 2016 15:25:37 +0100 Christian Theune wrote:
>>> 
>>>> Hi,
>>>> 
>>>> we’re currently expanding our cluster to grow the number of IOPS we
>>>> can provide to clients. We’re still on Hammer but in the process of
>>>> upgrading to Jewel. 
>>> 
>>> You might want to wait until the next Jewel release, given the current
>>> number of issues.
>> 
>> Our issues or Jewel’s? (This is only half a joke, a friend’s Ceph
>> cluster’s issues on Jewel is making me quite nervous and I’m
>> re-evaluating postponing already.)
>> 
> Jewel issues, like the most recent one with scrub sending OSDs to
> neverland.
> 
>>>> We started adding pure-SSD OSDs in the last days (based on MICRON
>>> S610DC-3840) and the slow requests we’ve seen in the past have started
>>> to show a different pattern.
>>>> 
>>> I looked in the archives and can't find a full description of your
>>> cluster hardware in any posts from you or the other Christian (hint,
>>> hint). Slow requests can nearly all the time being traced back to HW
>>> issues/limitations.
>> 
>> We’re currently running on 42% Christians. ;)
>> 
>> We currently have hosts of 2 generations in our cluster. They’re both
>> SuperMicro, sold by Thomas Krenn.
>> 
>> Type 1 (4 Hosts)
>> 
>>    SuperMicro X9DR3-F
>>    64 GiB RAM
>>    2 Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
> 
> A bit on the weak side, if you'd be doing lots of small IOPS, as you found
> out later. Not likely a real problem here, though.
> 
>>    LSI MegaRAID SAS 9271-8i
> And as you already found out, not a good match for Ceph (unless you actual
> do RAIDs, like me with Areca)
> Also LSI tends to have issues unless you're on the latest drivers
> (firmware and kernel side)

Currently have our OSD’s behind an LSI 3108 RoC, in individual RAID 0 virtual 
disks. Curious to see if this looks to be an issue in this case, as we also see 
slow requests, though we are completely on-disk journaling, though evaluating 
NVMe journaling possibly as soon as today.

While not idea, they are BBU backed, which is the result of being hurt from a 
power outage during PoC stage, when we used on-device write cache, and 
corrupted the leveldb beyond repair.
So while an HBA would be ‘ideal,’ the BBU backed RAID has added a level of 
resiliency that could be helpful to some (verified with further destructive 
testing both with disk cache enabled/disabled to verify reproducibility).

> 
>>    Dual 10-Gigabit X540-AT2
>>    OS on RAID 1 SATA HGST HUS724020ALS640 1.818 TB 7.2k
>>    Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those
>> should be DCP3700, maybe my inventory is wrong)
>> 
> That needs clarification, which ones are they now?
> 
>>    Pool “rbd.ssd”
>> 
>>        2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0)
>> 
> You're clearly a much braver man than me or have a very light load (but at
> least the 3510's have 1 DWPD endurance unlike the 0.3 of 3500s).  
> Please clarify, journals inline (as you suggest below) or also on the NVME?
> 
>>    Pool “rbd.hdd”
>> 
>>        5-6 OSDs on SATA HGST HUS724020ALS640 1.818 TB 7.2k (jbod/raid 0)
>>        1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0)
> Same here, journal inline or also on the NVME?
> 
>> 
>> Type 2 (2 Hosts)
>> 
>>    SuperMicro X9SRL-F
>>    32 GiB RAM
>>    2 Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz
>>    LSI MegaRAID SAS 9260-4i
>>    Dual 10-Gigabit X540-AT2
>>    OS on RAID 1 SATA (HGST HUS724020ALS640 1.818 TB)
>>    Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those
>> should be DCP3700, maybe my inventory is wrong)
>> 
>>    Pool “rbd.ssd”
>> 
>>        2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0)
>> 
>>    Pool “rbd.hdd”
>> 
>>        7 OSDs on HITACHI HUS156060VLS600 558.406 GB 15k (jbod/raid 0)
>>        1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0)
>> 
>> Network
>> 
>>    10 GE Copper on Brocade VDX 6740-T Switches w/ 20 GBit interconnect
>>     Storage Access and Cluster network on 2 physically separated ports
>>     Some KVM servers are still running 1 GE interfaces for access, some
>> 10 GE
>> 
>> Software
>> 
>>    Ceph: 0.94.7
>>    Filesystems: OS on ext4, OSDs on xfs
>>    Kernel: 4.4.27 (one host of type 2 still waiting for reboot to
>> upgrade from 4.1.16 in the next days) Qemu: 2.7 with librbd
>> 
>> Overall we’ve been more happy with the second type of hosts (the 15k
>> disks obviously don’t reach their limit in a mixed environment), but
>> generally we’ve been unhappy with having RAID controllers in there at
>> all. We’re going for pure HBA setups with newer machines.
>> 
> Nods.
> 
>> This cluster setup has been with us for a few years now and has seen
>> various generations of configurations ranging from initial huge OSD on
>> RAID 5, over temporary 2-disk RAID 0, and using CacheCade (until it died
>> ungracefully). Now it’s just single disk jbod/raid 0 with journals on
>> NVMe SSD. We added a pure SSD Pool with inline journals a while ago
>> which does not show any slow requests for the last 14 days or so. 
> 
> I'd be impressed to see slow requests from that pool, even with 3510s.
> Because that level of activity should cause your HDD pool to come to a
> total standstill. 
> 
>> We added 10GE about 18 months ago and migrated that from a temporary
>> solution to Brocade in September.
>> 
>>> Have you tested these SSDs for their suitability as a Ceph journal (I
>>> assume inline journals)? 
>>> This has been discussed here countless times, google.
>> 
>> We’re testing them right now. We stopped testing this in our development
>> environment as any learning aside from “works generally”/“doesn’t work
>> at all” has not been portable to our production load at all. All our
>> experiments trying to reliably replicate production traffic in a
>> reasonable way have failed.
>> 
>> I’ve FIOed them and was able to see their spec’d write IOPS (18k random
>> IO 4k with 32 QD) and throughput. I wasn’t able to see 100k read IOPS
>> but maxed out around 30k.
>> 
> Not FIO (the one you cite below) nor IOPS are of relevance to journals,
> sequential sync writes are. Find the most recent discussions here and for a
> slightly dated info with dubious data/results see:
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>  
> <https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/>
> 
> 
>>> If those SSDs aren't up to snuff, they may be worse than plain disks in
>>> some cases.
>>> 
>>> In addition, these SSDs have an endurance of 1 DWPD, less than 0.5 when
>>> factoring journal and FS overheads and write amplification scenarios.
>>> I'd be worried about wearing these out long before 5 years are over.
>> 
>> I reviewed the endurance from our existing hard drives. The hard drives
>> had less than 0.05 DWPD since we’ve added them. I wasn’t able to get
>> data from the NVMe drives about how much data they actually wrote to be
>> able to compute endurance. The Intel tool currently shows an endurance
>> analyzer of 14.4 years. This might be a completely useless number due to
>> aggressive over-provisioning of the SSD.
> 
> No SMART interface for the NVMes? I don't have any, so I wouldn't know.

Smartmontools says they added basic NVMe support in 6.5, but I haven’t jumped 
from 6.2 yet.

Intel’s SSD Data Center Tool (isdct) is not entirely intuitive, but provides 
some SMART statistics, including NAND Bytes Written

> # isdct show -o json -smart F4 -intelssd
> {
>       "SMART Attributes PHMD54560053400AGN":
>       {
>               "F4":
>               {
>                       "Description":"NAND Bytes Written",
>                       "ID":"F4",
>                       "Normalized":100,
>                       "Raw":1594960
>               }
>       }
> }


There are some other statistics worth checking, including the ‘SmartHealthInfo’ 
nvmelog.

Reed

> 
> Well, at 0.05 DWPD you're probably fine, so much more IOPS than bandwidth
> bound.
> 
> Some average/peak IOPS and bandwidth numbers (ceph -s) from your cluster
> would give us some idea if you're trying to do the impossible or if your
> HW is to blame to some extend at least.
> 
>> 
>> However, I’ll be monitoring this more closely on the Microns and not
>> worry about the full 5 years just yet. 
>> 
>>>> I’m currently seeing those: 
>>>> 
>>>> 2016-12-05 15:13:37.527469 osd.60 172.22.4.46:6818/19894 8080 :
>>>> cluster [WRN] 5 slow requests, 1 included below; oldest blocked for >
>>>> 31.675358 secs 2016-12-05 15:13:37.527478 osd.60
>>>> 172.22.4.46:6818/19894 8081 : cluster [WRN] slow request 31.674886
>>>> seconds old, received at 2016-12-05 15:13:05.852525:
>>>> osd_op(client.518589944.0:2734750
>>>> rbd_data.1e2b40f879e2a9e3.00000000000000a2 [stat,set-alloc-hint
>>>> object_size 4194304 write_size 4194304,write 1892352~4096]
>>>> 277.ceaf1c22 ack+ondisk+write+known_if_redirected e1107736) currently
>>>> waiting for rw locks
>>>> 
>>> Is osd.60 one of these SSDs?
>> 
>> It is. However, I can also show any other of the OSDs from the same pool
>> - this isn’t restricted to the Micron SSDs but also shows for the 1.8 TB
>> SATA drives.I reviewed all the logs I got from the last two weeks and
>> out of all complaints those have always made up ~30% of the slow
>> requests we see. Others include:
>> 
> Again, seeing slow requests from an SSD based OSD strikes me as odd, aside
> from LSI issues (which should show up in kernel logs) the main suspect
> would be the journal writes totally overwhelming it.
> Unlike the Intel 3xxx which are known to handle things (sync writes)
> according to their specs. 
> 
>> currently waiting for subops from … (~35%)
>> currently waiting for rw (~30%)
>> currently no flag points reached (~20%)
>> currently commit_sent (<5%)
>> currently reached_pg (<3%)
>> currently started (<1%)
>> currently waiting for scrub (extremely rare)
>> 
>> The statistics for waiting for sub operations include OSD 60 (~1886)
>> more than any others, even though all of the others do show a
>> significant amount (~100-400). 
> Those are not relevant (related) to this OSD.
> 
>> My current guess is that OSD 60 has gotten more than its fair share of
>> PGs temporarily and a lot of the movement on the cluster just affected it
>> much more. I added 3 of the Micron SSDs as the same time and the two
>> others are well in the normal range. I don’t have distribution over time,
>> so I can’t prove that right now. The Microns do have a) higher primary
>> affinity b) double the size than the normal SATA drives so that should
>> result in more requests ending up there.
>> 
> 
> Naturally.
> I'm not convinced that mixing HDDs and SSDs is the way forward, either you
> can handle your load with HDDs (enough of them) and fast journals, or you
> can't and are better off going all SSD, unless your use case is a match
> for cache-tiering. 
> 
> 
>> Looking at IOstat I don’t see them creating large backlogs - at least I
>> can’t spot them interactively.
>> 
> 
> You should see high latencies, utilization in there when the slow requests
> happen.
> Also look at it with atop in a larger terminal and with at least 5s
> refresh interval.
> 
>> Looking at the specific traffic that Ceph with inline journal puts on
>> the device, then during backfill it appears to be saturated from a
>> bandwith perspective. I see around 100MB/s going in with around 1k write
>> IOPS. During this time, iostat shows an average queue length of around
>> 200 requests and an average wait of around 150ms, which is quite high.
> Precisely.
> The journal/sync write plot thickens.
> 
>> Adding a 4k FIO [1] at the same time doesn’t change throughput or IOPS
>> significantly. Stopping the OSD then shows IOPS going up to 12k r/w, but
>> of course bandwith is much less at that point.
>> 
>> Something we have looked out for again but weren’t able to find
>> something on is whether the controller might be limiting us in some way.
>> Then again: the Intel SSDs are attached to the same controller and have
>> not shown *any* slow request at all. All the slow requests are defined
>> the the rbd.hdd pool (which now includes the Microns as part of a
>> migration strategy).
>> 
>> I’m also wondering whether we may be CPU bound, but the stats do not
>> indicate that. We’re running the “ondemand” governor and I’ve seen
>> references to people preferring the “performance” governor. However, I
>> would expect that we’d see significantly reduced idle times when this
>> becomes an issue.
>> 
> The performance governor will help with (Ceph induced) latencies foremost. 
> 
>>> Tools like atop and iostat can give you a good insight on how your
>>> storage subsystem is doing.
>> 
>> Right. We have collectd running including the Ceph plugin which helps
>> finding historical data. At the moment I’m not getting much out of the
>> interactive tools as any issue will have resolved once I’m logged in. ;)
>> 
> I suppose you could always force things with a fio from a VM or a 
> "rados bench".
> 
> For now, concentrate on validating those Microns.
> 
> Christian
> 
>>> The moment you see slow requests, your cluster has issues you need to
>>> address.
>>> Especially if this happens when no scrubs are going on (restrict scrubs
>>> to off-peak hours).
>> 
>> I’ve seen this happen when no scrubs are around. I can currently not
>> pinpoint performance issues to ongoing scrubs. We have limited scrubs to
>> a low load threshold but run them regularly during the day.
>> 
>>> The 30 seconds is an arbitrary value and the WRN would warrant an ERR
>>> in my book, as some applications take a very dim view on being blocked
>>> for more than few seconds.
>> 
>> Absolutely: we see adverse effects within VMs much earlier and have
>> lowered the complaint time to 15 seconds due to that. 
>> 
>> Here’s the Ceph config we’re running with. Some of the values are
>> temporarily tuned at runtime to support operations (like max backfills,
>> etc.):
>> 
>> [global]
>> fsid = ...
>> public network = 172.22.4.0/24
>> cluster network = 172.22.8.0/24
>> pid file = /run/ceph/$type-$id.pid
>> # should correspond to the setting in /etc/init.d/ceph
>> max open files = 262144
>> mon host = <3 mons>
>> mon osd nearfull ratio = .9
>> osd pool default min size = 1
>> osd pool default size = 2
>> # recovery sleep becomes available in infernalis.
>> # osd recovery sleep = 0.005
>> osd snap trim sleep = 0.05
>> osd scrub sleep = 0.05
>> 
>> [client]
>> log file = /var/log/ceph/client.log
>> rbd cache = true
>> rbd default format = 2
>> admin socket = /run/ceph/rbd-$pid-$cctid.asok
>> 
>> [mon]
>> admin socket = /run/ceph/$cluster-$name.asok
>> mon addr = 172.22.4.42:6789,172.22.4.43:6789,172.22.4.54:6789
>> mon data = /srv/ceph/mon/$cluster-$id
>> mon data avail crit = 2
>> mon data avail warn = 5
>> mon osd allow primary affinity = true
>> mon pg warn max per osd = 3000
>> 
>> [mon.cartman07]
>> host = cartman07
>> mon addr = 172.22.4.43:6789
>> public addr = 172.22.4.43:6789
>> cluster addr = 172.22.8.13:6789
>> 
>> [osd]
>> host = cartman07
>> admin socket = /run/ceph/$cluster-$name.asok
>> public addr = 172.22.4.43
>> cluster addr = 172.22.8.13
>> filestore fiemap = true
>> filestore max sync interval = 5
>> # slightly more than 1 week to avoid hitting the same day the same time
>> osd deep scrub interval = 700000
>> # place this between "nearfull" and "full"
>> osd backfill full ratio = 0.92
>> # 2* 5 seconds * 200MB/s disk throughput = 2000MB
>> osd journal size = 2000
>> osd max backfills = 1
>> osd op complaint time = 15
>> osd recovery max active = 1
>> osd recovery op priority = 1
>> osd scrub load threshold = 2
>> 
>> Something I started playing around with today was looking up specific
>> slow requests from the admin sockets. Here’s what happened a bit
>> earlier. Something I see is that the request flaps between “reached pg”
>> and “waiting for rw locks”. I’ll provide a set of 3 dumps that are
>> correlated. I started with OSD 30 (Micron) and pulled the ones from OSD
>> 33 and 22 as they are referenced from the “waiting for subops from”.
>> 
> 
> 
> -- 
> Christian Balzer        Network/Systems Engineer                
> [email protected] <mailto:[email protected]>          Global OnLine Japan/Rakuten 
> Communications
> http://www.gol.com/ <http://www.gol.com/>
> _______________________________________________
> ceph-users mailing list
> [email protected] <mailto:[email protected]>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to