On 10/21/2015 11:25 AM, Jan Schermer wrote: > >> On 21 Oct 2015, at 09:11, Wido den Hollander <w...@42on.com> wrote: >> >> On 10/20/2015 09:45 PM, Martin Millnert wrote: >>> The thing that worries me with your next-gen design (actually your current >>> design aswell) is SSD wear. If you use Intel SSD at 10 DWPD, that's >>> 12TB/day per 64TB total. I guess use case dependant, and perhaps 1:4 >>> write read ratio is quite high in terms of writes as-is. >>> You're also throughput-limiting yourself to the pci-e bw of the NVME device >>> (regardless of NVRAM/SSD). Compared to traditonal interface, that may be ok >>> of course in relative terms. NVRAM vs SSD here is simply a choice between >>> wear (NVRAM as journal minimum), and cache hit probability (size). >>> Interesting thought experiment anyway for me, thanks for sharing Wido. >>> /M >> >> We are looking at the PC 3600DC 1.2TB, according to the specs from >> Intel: 10.95PBW >> >> Like I mentioned in my reply to Mark, we are still running on 1Gbit and >> heading towards 10Gbit. >> >> Bandwidth isn't really a issue in our cluster. During peak moments we >> average about 30k IOps through the cluster, but the TOTAL client I/O is >> just 1Gbit Read and Write. Sometimes a bit higher, but mainly small I/O. >> >> Bandwidth-wise there is no need for 10Gbit, but we are doing it for the >> lower latency and thus more IOps. >> >> Currently our S3700 SSDs are peaking at 50% utilization according to iostat. >> >> After 2 years of operation the lowest Media_Wearout_Indicator we see is >> 33. On Intel SSDs this starts at 100 and counts down to 0. 0 indicating >> that the SSD is worn out. >> >> So in 24 months we worn through 67% of the SSD. A quick calculation >> tells me we still have 12 months left on that SSD before it dies. > > Could you maybe run isdct and compare what it says about expected lifetime? I > think isdct will report a much longer lifetime than you expect. > > For comparison one of my drives (S3610, 1.2TB) - this drive has 3 DWPD rating > (~6.5PB written) > > 241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always > - 1487714 <-- units of 32MB, that translates to ~47TB > 233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always > - 0 (maybe my smartdb needs updating, but this is what it says) > 9 Power_On_Hours 0x0032 100 100 000 Old_age Always > - 1008 > > If I extrapolate this blindly I would expect the SSD to reach it's TBW of > 6.5PB in about 15 years. > > But isdct says: > EnduranceAnalyzer: 46.02 Years > > If I reverse it and calculate the endurance based on smart values, that would > give the expected lifetime of over 18PB (which is not impossible at all), but > isdct is a bit smarter and looks at what the current use pattern is. It's > clearly not only about discarding the initial bursts when the drive was > filled during backfilling because it's not that much, and all my S3610 drives > indicate a similiar endurance of 40 years (+-10). > > I'd trust isdct over extrapolated SMART values - I think the SSD will > actually switch to a different calculation scheme when it reaches certain > lifepoint (when all reserve blocks are used, or when first cells start to > die...) which is why there's a discrepancy. >
$ ./isdct show -intelssd $ ./isdct set -intelssd 1 enduranceanalyzer=reset << WAIT >60 MIN >> $ ./isdct show -a -intelssd 1 Told me: DevicePath: /dev/sg3 DeviceStatus: Healthy EnduranceAnalyzer: 0.38 Years ErrorString: So this SSD has little over 6 months before it dies. However, the cluster has been in a backfill state for over 24 hours now and I've taken this sample while a backfill was happening, probably not the best time. IF the 0.38 years is true, then the SSD had a lifespawn of 2 years and 8 months as a Journaling / Bcache SSD. Wido > Jan > > >> >> But this is the lowest, other SSDs which were taken into production at >> the same moment are ranging between 36 and 61. >> >> Also, when buying the 1.2TB SSD we'll probably allocate only 1TB of the >> SSD and leave 200GB of cells spare so the Wear-Leveling inside the SSD >> has some spare cells. >> >> Wido >> >>> >>> -------- Original message -------- >>> From: Wido den Hollander <w...@42on.com> >>> Date: 20/10/2015 16:00 (GMT+01:00) >>> To: ceph-users <ceph-us...@ceph.com> >>> Subject: [ceph-users] Ceph OSDs with bcache experience >>> >>> Hi, >>> >>> In the "newstore direction" thread on ceph-devel I wrote that I'm using >>> bcache in production and Mark Nelson asked me to share some details. >>> >>> Bcache is running in two clusters now that I manage, but I'll keep this >>> information to one of them (the one at PCextreme behind CloudStack). >>> >>> In this cluster has been running for over 2 years now: >>> >>> epoch 284353 >>> fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307 >>> created 2013-09-23 11:06:11.819520 >>> modified 2015-10-20 15:27:48.734213 >>> >>> The system consists out of 39 hosts: >>> >>> 2U SuperMicro chassis: >>> * 80GB Intel SSD for OS >>> * 240GB Intel S3700 SSD for Journaling + Bcache >>> * 6x 3TB disk >>> >>> This isn't the newest hardware. The next batch of hardware will be more >>> disks per chassis, but this is it for now. >>> >>> All systems were installed with Ubuntu 12.04, but they are all running >>> 14.04 now with bcache. >>> >>> The Intel S3700 SSD is partitioned with a GPT label: >>> - 5GB Journal for each OSD >>> - 200GB Partition for bcache >>> >>> root@ceph11:~# df -h|grep osd >>> /dev/bcache0 2.8T 1.1T 1.8T 38% /var/lib/ceph/osd/ceph-60 >>> /dev/bcache1 2.8T 1.2T 1.7T 41% /var/lib/ceph/osd/ceph-61 >>> /dev/bcache2 2.8T 930G 1.9T 34% /var/lib/ceph/osd/ceph-62 >>> /dev/bcache3 2.8T 970G 1.8T 35% /var/lib/ceph/osd/ceph-63 >>> /dev/bcache4 2.8T 814G 2.0T 30% /var/lib/ceph/osd/ceph-64 >>> /dev/bcache5 2.8T 915G 1.9T 33% /var/lib/ceph/osd/ceph-65 >>> root@ceph11:~# >>> >>> root@ceph11:~# lsb_release -a >>> No LSB modules are available. >>> Distributor ID: Ubuntu >>> Description: Ubuntu 14.04.3 LTS >>> Release: 14.04 >>> Codename: trusty >>> root@ceph11:~# uname -r >>> 3.19.0-30-generic >>> root@ceph11:~# >>> >>> "apply_latency": { >>> "avgcount": 2985023, >>> "sum": 226219.891559000 >>> } >>> >>> What did we notice? >>> - Less spikes on the disk >>> - Lower commit latencies on the OSDs >>> - Almost no 'slow requests' during backfills >>> - Cache-hit ratio of about 60% >>> >>> Max backfills and recovery active are both set to 1 on all OSDs. >>> >>> For the next generation hardware we are looking into using 3U chassis >>> with 16 4TB SATA drives and a 1.2TB NVM-E SSD for bcache, but we haven't >>> tested those yet, so nothing to say about it. >>> >>> The current setup is 200GB of cache for 18TB of disks. The new setup >>> will be 1200GB for 64TB, curious to see what that does. >>> >>> Our main conclusion however is that it does smoothen the I/O-pattern >>> towards the disks and that gives a overall better response of the disks. >>> >>> Wido >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> >> -- >> Wido den Hollander >> 42on B.V. >> Ceph trainer and consultant >> >> Phone: +31 (0)20 700 9902 >> Skype: contact42on >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com