Re: [ceph-users] Ceph OSDs with bcache experience

Wido den Hollander Thu, 22 Oct 2015 03:06:36 -0700

On 10/21/2015 11:25 AM, Jan Schermer wrote:
> 
>> On 21 Oct 2015, at 09:11, Wido den Hollander <w...@42on.com> wrote:
>>
>> On 10/20/2015 09:45 PM, Martin Millnert wrote:
>>> The thing that worries me with your next-gen design (actually your current 
>>> design aswell) is SSD wear. If you use Intel SSD at 10 DWPD, that's 
>>> 12TB/day per 64TB total.  I guess use case dependant,  and perhaps 1:4 
>>> write read ratio is quite high in terms of writes as-is.
>>> You're also throughput-limiting yourself to the pci-e bw of the NVME device 
>>> (regardless of NVRAM/SSD). Compared to traditonal interface, that may be ok 
>>> of course in relative terms. NVRAM vs SSD here is simply a choice between 
>>> wear (NVRAM as journal minimum), and cache hit probability (size).  
>>> Interesting thought experiment anyway for me, thanks for sharing Wido.
>>> /M
>>
>> We are looking at the PC 3600DC 1.2TB, according to the specs from
>> Intel: 10.95PBW
>>
>> Like I mentioned in my reply to Mark, we are still running on 1Gbit and
>> heading towards 10Gbit.
>>
>> Bandwidth isn't really a issue in our cluster. During peak moments we
>> average about 30k IOps through the cluster, but the TOTAL client I/O is
>> just 1Gbit Read and Write. Sometimes a bit higher, but mainly small I/O.
>>
>> Bandwidth-wise there is no need for 10Gbit, but we are doing it for the
>> lower latency and thus more IOps.
>>
>> Currently our S3700 SSDs are peaking at 50% utilization according to iostat.
>>
>> After 2 years of operation the lowest Media_Wearout_Indicator we see is
>> 33. On Intel SSDs this starts at 100 and counts down to 0. 0 indicating
>> that the SSD is worn out.
>>
>> So in 24 months we worn through 67% of the SSD. A quick calculation
>> tells me we still have 12 months left on that SSD before it dies.
> 
> Could you maybe run isdct and compare what it says about expected lifetime? I 
> think isdct will report a much longer lifetime than you expect.
> 
> For comparison one of my drives (S3610, 1.2TB) - this drive has 3 DWPD rating 
> (~6.5PB written)
> 
> 241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always      
>  -       1487714 <-- units of 32MB, that translates to ~47TB
> 233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always      
>  -       0 (maybe my smartdb needs updating, but this is what it says)
> 9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       
> -       1008
> 
> If I extrapolate this blindly I would expect the SSD to reach it's TBW of 
> 6.5PB in about 15 years.
> 
> But isdct says:
> EnduranceAnalyzer: 46.02 Years
> 
> If I reverse it and calculate the endurance based on smart values, that would 
> give the expected lifetime of over 18PB (which is not impossible at all), but 
> isdct is a bit smarter and looks at what the current use pattern is. It's 
> clearly not only about discarding the initial bursts when the drive was 
> filled during backfilling because it's not that much, and all my S3610 drives 
> indicate a similiar endurance of 40 years (+-10).
> 
> I'd trust isdct over extrapolated SMART values - I think the SSD will 
> actually switch to a different calculation scheme when it reaches certain 
> lifepoint (when all reserve blocks are used, or when first cells start to 
> die...) which is why there's a discrepancy.
>



$ ./isdct show -intelssd
$ ./isdct set -intelssd 1 enduranceanalyzer=reset
<< WAIT >60 MIN >>
$ ./isdct show -a -intelssd 1

Told me:

DevicePath: /dev/sg3
DeviceStatus: Healthy
EnduranceAnalyzer: 0.38 Years
ErrorString:

So this SSD has little over 6 months before it dies. However, the
cluster has been in a backfill state for over 24 hours now and I've
taken this sample while a backfill was happening, probably not the best
time.

IF the 0.38 years is true, then the SSD had a lifespawn of 2 years and 8
months as a Journaling / Bcache SSD.

Wido

> Jan
> 
> 
>>
>> But this is the lowest, other SSDs which were taken into production at
>> the same moment are ranging between 36 and 61.
>>
>> Also, when buying the 1.2TB SSD we'll probably allocate only 1TB of the
>> SSD and leave 200GB of cells spare so the Wear-Leveling inside the SSD
>> has some spare cells.
>>
>> Wido
>>
>>>
>>> -------- Original message --------
>>> From: Wido den Hollander <w...@42on.com> 
>>> Date: 20/10/2015  16:00  (GMT+01:00) 
>>> To: ceph-users <ceph-us...@ceph.com> 
>>> Subject: [ceph-users] Ceph OSDs with bcache experience 
>>>
>>> Hi,
>>>
>>> In the "newstore direction" thread on ceph-devel I wrote that I'm using
>>> bcache in production and Mark Nelson asked me to share some details.
>>>
>>> Bcache is running in two clusters now that I manage, but I'll keep this
>>> information to one of them (the one at PCextreme behind CloudStack).
>>>
>>> In this cluster has been running for over 2 years now:
>>>
>>> epoch 284353
>>> fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307
>>> created 2013-09-23 11:06:11.819520
>>> modified 2015-10-20 15:27:48.734213
>>>
>>> The system consists out of 39 hosts:
>>>
>>> 2U SuperMicro chassis:
>>> * 80GB Intel SSD for OS
>>> * 240GB Intel S3700 SSD for Journaling + Bcache
>>> * 6x 3TB disk
>>>
>>> This isn't the newest hardware. The next batch of hardware will be more
>>> disks per chassis, but this is it for now.
>>>
>>> All systems were installed with Ubuntu 12.04, but they are all running
>>> 14.04 now with bcache.
>>>
>>> The Intel S3700 SSD is partitioned with a GPT label:
>>> - 5GB Journal for each OSD
>>> - 200GB Partition for bcache
>>>
>>> root@ceph11:~# df -h|grep osd
>>> /dev/bcache0    2.8T  1.1T  1.8T  38% /var/lib/ceph/osd/ceph-60
>>> /dev/bcache1    2.8T  1.2T  1.7T  41% /var/lib/ceph/osd/ceph-61
>>> /dev/bcache2    2.8T  930G  1.9T  34% /var/lib/ceph/osd/ceph-62
>>> /dev/bcache3    2.8T  970G  1.8T  35% /var/lib/ceph/osd/ceph-63
>>> /dev/bcache4    2.8T  814G  2.0T  30% /var/lib/ceph/osd/ceph-64
>>> /dev/bcache5    2.8T  915G  1.9T  33% /var/lib/ceph/osd/ceph-65
>>> root@ceph11:~#
>>>
>>> root@ceph11:~# lsb_release -a
>>> No LSB modules are available.
>>> Distributor ID:     Ubuntu
>>> Description:        Ubuntu 14.04.3 LTS
>>> Release:    14.04
>>> Codename:   trusty
>>> root@ceph11:~# uname -r
>>> 3.19.0-30-generic
>>> root@ceph11:~#
>>>
>>> "apply_latency": {
>>>    "avgcount": 2985023,
>>>    "sum": 226219.891559000
>>> }
>>>
>>> What did we notice?
>>> - Less spikes on the disk
>>> - Lower commit latencies on the OSDs
>>> - Almost no 'slow requests' during backfills
>>> - Cache-hit ratio of about 60%
>>>
>>> Max backfills and recovery active are both set to 1 on all OSDs.
>>>
>>> For the next generation hardware we are looking into using 3U chassis
>>> with 16 4TB SATA drives and a 1.2TB NVM-E SSD for bcache, but we haven't
>>> tested those yet, so nothing to say about it.
>>>
>>> The current setup is 200GB of cache for 18TB of disks. The new setup
>>> will be 1200GB for 64TB, curious to see what that does.
>>>
>>> Our main conclusion however is that it does smoothen the I/O-pattern
>>> towards the disks and that gives a overall better response of the disks.
>>>
>>> Wido
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> -- 
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph OSDs with bcache experience

Reply via email to