I can attest to this. I had a cluster that used 3510's for the first
rack and then switched to 3710's after that. We had 3TB drives and
every single 3510 ran out of writes after 1.5 years. We noticed
because we tracked down incredibly slow performance to a subset of
OSDs and each time they had a common journal. This happened for about
2 weeks and 4 journals, That was when we realized that they were all
3510 journals and SMART showed not only the journals we had tracked
down, but all of the 3510's were out of writes. Replacing all of your
journals every 1.5 years is way more expensive than the increased cost
of the 3710's. That was our use case and experience, but I'm pretty
sure that any cluster large enough to fill at least most of a rack
will run into this much sooner than later.
On Mon, May 1, 2017 at 11:15 AM Maxime Guyot <[email protected]
<mailto:[email protected]>> wrote:
Hi,
Lots of good info on SSD endurance in this thread.
For Ceph journal you should also consider the size of the backing
OSDs: the SSD journal won't last as long if backing 5x8TB OSDs or
5x1TB OSDs.
For example, the S3510 480GB (275TB of endurance), if backing
5x8TB (40TB) OSDs, will provide very little endurance, assuming
triple replication you will be able to fill the OSDs twice and
that's about it (275/(5x8)/3).
On the other end of the scale a 1.2TB S3710 backing 5x1TB will be
able to fill them 1620 times before running out of endurance
(24300/(5x1)/3).
Ultimately it depends on your workload. Some people can get away
with S3510 as journals if the workload is read intensive, but in
most cases the higher endurance is a safe bet (S3710 or S3610).
Cheers,
Maxime
On Mon, 1 May 2017 at 11:04 Jens Dueholm Christensen
<[email protected] <mailto:[email protected]>> wrote:
Sorry for topposting, but..
The Intel 35xx drives are rated for a much lower DWPD
(drive-writes-per-day) than the 36xx or 37xx models.
Keep in mind that a single SSD that acts as journal for 5 OSDs
will recieve ALL writes for those 5 OSDs before the data is
moved off to the OSDs actual data drives.
This makes for quite a lot of writes, and along with the
consumer/enterprise advice others have written about, your SSD
journal devices will recieve quite a lot of writes over time.
The S3510 is rated for 0.3 DWPD for 5 years
(http://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3510-spec.html)
The S3610 is rated for 3 DWPD for 5 years
(http://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3610-spec.html)
The S3710 is rated for 10 DWPD for 5 years
(http://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3710-spec.html)
A 480GB S3510 has no endurance left once you have written
0.275PB to it.
A 480GB S3610 has no endurance left once you have written
3.7PB to it.
A 400GB S3710 has no endurance left once you have written
8.3PB to it.
This makes for quite a lot of difference over time - even if a
S3510 wil only act as journal for 1 or 2 OSDs, it will wear
out much much much faster than others.
And I know I've used the xx10 models above, but the xx00
models have all been replaced by those newer models now.
And yes, the xx10 models are using MLC NAND, but so were the
xx00 models, that have a proven trackrecord and delivers what
Intel promised in the datasheet.
You could try and take a look at some of the enterprise SSDs
that Samsung has launched.
Price-wise they are very competitive to Intel, but I want to
see (or at least hear from others) if they can deliver what
their datasheet promises.
Samsungs consumer SSDs did not (840/850 Pro), so I'm only
using S3710s in my cluster.
Before I created our own cluster some time ago, I found these
threads from the mailinglist regarding the exact same disks we
had been expecting to use (Samsung 840/850 Pro), that was
quickly changed to Intel S3710s:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-November/044258.html
https://www.mail-archive.com/[email protected]/msg17369.html
A longish thread about Samsung consumer drives:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000572.html
- highlights from that thread:
-
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000610.html
-
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000611.html
-
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000798.html
Regards,
Jens Dueholm Christensen
Rambøll Survey IT
-----Original Message-----
From: ceph-users [mailto:[email protected]
<mailto:[email protected]>] On Behalf Of Adam
Carheden
Sent: Wednesday, April 26, 2017 5:54 PM
To: [email protected] <mailto:[email protected]>
Subject: Re: [ceph-users] Sharing SSD journals and SSD drive
choice
Thanks everyone for the replies.
I will be avoiding TLC drives, it was just something easy to
benchmark
with existing equipment. I hadn't though of unscrupulous data
durability
lies or performance suddenly tanking in unpredictable ways. I
guess it
all comes down to trusting the vendor since it would be
expensive in
time and $$ to test for such things.
Any thoughts on multiple Intel 35XX vs a single 36XX/37XX? All
have "DC"
prefixes and are listed in the Data Center section of their
marketing
pages, so I assume they'll all have the same quality
underlying NAND.
--
Adam Carheden
On 04/26/2017 09:20 AM, Chris Apsey wrote:
> Adam,
>
> Before we deployed our cluster, we did extensive testing on
all kinds of
> SSDs, from consumer-grade TLC SATA all the way to Enterprise
PCI-E NVME
> Drives. We ended up going with a ratio of 1x Intel P3608
PCI-E 1.6 TB
> to 12x HGST 10TB SAS3 HDDs. It provided the best
> price/performance/density balance for us overall. As a frame of
> reference, we have 384 OSDs spread across 16 nodes.
>
> A few (anecdotal) notes:
>
> 1. Consumer SSDs have unpredictable performance under load;
write
> latency can go from normal to unusable with almost no warning.
> Enterprise drives generally show much less load sensitivity.
> 2. Write endurance; while it may appear that having several
> consumer-grade SSDs backing a smaller number of OSDs will
yield better
> longevity than an enterprise grade SSD backing a larger
number of OSDs,
> the reality is that enterprise drives that use SLC or eMLC
are generally
> an order of magnitude more reliable when all is said and done.
> 3. Power Loss protection (PLP). Consumer drives generally
don't do well
> when power is suddenly lost. Yes, we should all have UPS,
etc., but
> things happen. Enterprise drives are much more tolerant of
> environmental failures. Recovering from misplaced objects
while also
> attempting to serve clients is no fun.
>
>
>
>
>
> ---
> v/r
>
> Chris Apsey
> [email protected] <mailto:[email protected]>
> https://www.bitskrieg.net
>
> On 2017-04-26 10:53, Adam Carheden wrote:
>> What I'm trying to get from the list is /why/ the
"enterprise" drives
>> are important. Performance? Reliability? Something else?
>>
>> The Intel was the only one I was seriously considering. The
others were
>> just ones I had for other purposes, so I thought I'd see
how they fared
>> in benchmarks.
>>
>> The Intel was the clear winner, but my tests did show that
throughput
>> tanked with more threads. Hypothetically, if I was throwing
16 OSDs at
>> it, all with osd op threads = 2, do the benchmarks below
not show that
>> the Hynix would be a better choice (at least for performance)?
>>
>> Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC
S3610. Obviously
>> the single drive leaves more bays free for OSD disks, but
is there any
>> other reason a single S3610 is preferable to 4 S3520s?
Wouldn't 4xS3520s
>> mean:
>>
>> a) fewer OSDs go down if the SSD fails
>>
>> b) better throughput (I'm speculating that the S3610 isn't
4 times
>> faster than the S3520)
>>
>> c) load spread across 4 SATA channels (I suppose this
doesn't really
>> matter since the drives can't throttle the SATA bus).
>>
>>
>> --
>> Adam Carheden
>>
>> On 04/26/2017 01:55 AM, Eneko Lacunza wrote:
>>> Adam,
>>>
>>> What David said before about SSD drives is very important.
I will tell
>>> you another way: use enterprise grade SSD drives, not
consumer grade.
>>> Also, pay attention to endurance.
>>>
>>> The only suitable drive for Ceph I see in your tests is
SSDSC2BB150G7,
>>> and probably it isn't even the most suitable SATA SSD disk
from Intel;
>>> better use S3610 o S3710 series.
>>>
>>> Cheers
>>> Eneko
>>>
>>> El 25/04/17 a las 21:02, Adam Carheden escribió:
>>>> On 04/25/2017 11:57 AM, David wrote:
>>>>> On 19 Apr 2017 18:01, "Adam Carheden" <[email protected]
<mailto:[email protected]>
>>>>> <mailto:[email protected] <mailto:[email protected]>>>
wrote:
>>>>>
>>>>> Does anyone know if XFS uses a single thread to
write to it's
>>>>> journal?
>>>>>
>>>>>
>>>>> You probably know this but just to avoid any confusion,
the journal in
>>>>> this context isn't the metadata journaling in XFS, it's
a separate
>>>>> journal written to by the OSD daemons
>>>> Ha! I didn't know that.
>>>>
>>>>> I think the number of threads per OSD is controlled by
the 'osd op
>>>>> threads' setting which defaults to 2
>>>> So the ideal (for performance) CEPH cluster would be one
SSD per HDD
>>>> with 'osd op threads' set to whatever value fio shows as
the optimal
>>>> number of threads for that drive then?
>>>>
>>>>> I would avoid the SanDisk and Hynix. The s3500 isn't too
bad. Perhaps
>>>>> consider going up to a 37xx and putting more OSDs on it.
Of course
>>>>> with
>>>>> the caveat that you'll lose more OSDs if it goes down.
>>>> Why would you avoid the SanDisk and Hynix? Reliability (I
think those
>>>> two are both TLC)? Brand trust? If it's my benchmarks in
my previous
>>>> email, why not the Hynix? It's slower than the Intel, but
sort of
>>>> decent, at lease compared to the SanDisk.
>>>>
>>>> My final numbers are below, including an older Samsung
Evo (MCL I
>>>> think)
>>>> which did horribly, though not as bad as the SanDisk. The
Seagate is a
>>>> 10kRPM SAS "spinny" drive I tested as a
control/SSD-to-HDD comparison.
>>>>
>>>> SanDisk SDSSDA240G, fio 1 jobs: 7.0 MB/s (5
trials)
>>>>
>>>>
>>>> SanDisk SDSSDA240G, fio 2 jobs: 7.6 MB/s (5
trials)
>>>>
>>>>
>>>> SanDisk SDSSDA240G, fio 4 jobs: 7.5 MB/s (5
trials)
>>>>
>>>>
>>>> SanDisk SDSSDA240G, fio 8 jobs: 7.6 MB/s (5
trials)
>>>>
>>>>
>>>> SanDisk SDSSDA240G, fio 16 jobs: 7.6 MB/s (5
trials)
>>>>
>>>>
>>>> SanDisk SDSSDA240G, fio 32 jobs: 7.6 MB/s (5
trials)
>>>>
>>>>
>>>> SanDisk SDSSDA240G, fio 64 jobs: 7.6 MB/s (5
trials)
>>>>
>>>>
>>>> HFS250G32TND-N1A2A 30000P10, fio 1 jobs: 4.2 MB/s (5
trials)
>>>>
>>>>
>>>> HFS250G32TND-N1A2A 30000P10, fio 2 jobs: 0.6 MB/s (5
trials)
>>>>
>>>>
>>>> HFS250G32TND-N1A2A 30000P10, fio 4 jobs: 7.5 MB/s (5
trials)
>>>>
>>>>
>>>> HFS250G32TND-N1A2A 30000P10, fio 8 jobs: 17.6 MB/s (5
trials)
>>>>
>>>>
>>>> HFS250G32TND-N1A2A 30000P10, fio 16 jobs: 32.4 MB/s (5
trials)
>>>>
>>>>
>>>> HFS250G32TND-N1A2A 30000P10, fio 32 jobs: 64.4 MB/s (5
trials)
>>>>
>>>>
>>>> HFS250G32TND-N1A2A 30000P10, fio 64 jobs: 71.6 MB/s (5
trials)
>>>>
>>>>
>>>> SAMSUNG SSD, fio 1 jobs: 2.2 MB/s (5
trials)
>>>>
>>>>
>>>> SAMSUNG SSD, fio 2 jobs: 3.9 MB/s (5
trials)
>>>>
>>>>
>>>> SAMSUNG SSD, fio 4 jobs: 7.1 MB/s (5
trials)
>>>>
>>>>
>>>> SAMSUNG SSD, fio 8 jobs: 12.0 MB/s (5
trials)
>>>>
>>>>
>>>> SAMSUNG SSD, fio 16 jobs: 18.3 MB/s (5
trials)
>>>>
>>>>
>>>> SAMSUNG SSD, fio 32 jobs: 25.4 MB/s (5
trials)
>>>>
>>>>
>>>> SAMSUNG SSD, fio 64 jobs: 26.5 MB/s (5
trials)
>>>>
>>>>
>>>> INTEL SSDSC2BB150G7, fio 1 jobs: 91.2 MB/s (5
trials)
>>>>
>>>>
>>>> INTEL SSDSC2BB150G7, fio 2 jobs: 132.4 MB/s (5
trials)
>>>>
>>>>
>>>> INTEL SSDSC2BB150G7, fio 4 jobs: 138.2 MB/s (5
trials)
>>>>
>>>>
>>>> INTEL SSDSC2BB150G7, fio 8 jobs: 116.9 MB/s (5
trials)
>>>>
>>>>
>>>> INTEL SSDSC2BB150G7, fio 16 jobs: 61.8 MB/s (5
trials)
>>>> INTEL SSDSC2BB150G7, fio 32 jobs: 22.7 MB/s (5
trials)
>>>> INTEL SSDSC2BB150G7, fio 64 jobs: 16.9 MB/s (5
trials)
>>>> SEAGATE ST9300603SS, fio 1 jobs: 0.7 MB/s (5
trials)
>>>> SEAGATE ST9300603SS, fio 2 jobs: 0.9 MB/s (5
trials)
>>>> SEAGATE ST9300603SS, fio 4 jobs: 1.6 MB/s (5
trials)
>>>> SEAGATE ST9300603SS, fio 8 jobs: 2.0 MB/s (5
trials)
>>>> SEAGATE ST9300603SS, fio 16 jobs: 4.6 MB/s (5
trials)
>>>> SEAGATE ST9300603SS, fio 32 jobs: 6.9 MB/s (5
trials)
>>>> SEAGATE ST9300603SS, fio 64 jobs: 0.6 MB/s (5
trials)
>>>>
>>>> For those who come across this and are looking for drives
for purposes
>>>> other than CEPH, those are all sequential write numbers
with caching
>>>> disabled, a very CEPH-journal-specific test. The SanDisk
held it's own
>>>> against the Intel using some benchmarks on Windows that
didn't disable
>>>> caching. It may very well be a perfectly good drive for
other purposes.
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> [email protected] <mailto:[email protected]>
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected] <mailto:[email protected]>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected] <mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected] <mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected] <mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com