Re: [ceph-users] PCIe journal benefit for SSD OSDs

Stefan Priebe - Profihost AG Thu, 07 Sep 2017 02:45:19 -0700

Am 07.09.2017 um 10:44 schrieb Christian Balzer:
> 
> Hello,
> 
> On Thu, 7 Sep 2017 08:03:31 +0200 Stefan Priebe - Profihost AG wrote:
> 
>> Hello,
>> Am 07.09.2017 um 03:53 schrieb Christian Balzer:
>>>
>>> Hello,
>>>
>>> On Wed, 6 Sep 2017 09:09:54 -0400 Alex Gorbachev wrote:
>>>   
>>>> We are planning a Jewel filestore based cluster for a performance
>>>> sensitive healthcare client, and the conservative OSD choice is
>>>> Samsung SM863A.
>>>>  
>>>
>>> While I totally see where you're coming from and me having stated that
>>> I'll give Luminous and Bluestore some time to mature, I'd also be looking
>>> into that if I were being in the planning phase now, with like 3 months
>>> before deployment.
>>> The inherent performance increase with Bluestore (and having something
>>> that hopefully won't need touching/upgrading for a while) shouldn't be
>>> ignored.   
>>
>> Yes and that's the point where i'm currently as well. Thinking about how
>> to design a new cluster based on bluestore.
>>
>>> The SSDs are fine, I've been starting to use those recently (though not
>>> with Ceph yet) as Intel DC S36xx or 37xx are impossible to get.
>>> They're a bit slower in the write IOPS department, but good enough for me.  
>>
>> I've never used the Intel DC ones but always the Samsung are the Intel
>> really faster? 
> I don't have any configuration right now where to directly compare them
> (different HW, controllers, kernel and fio versions), but at least on
> paper a 200GB DC S3700 (unobtanium) with 32K random 4k IOPS (confirmed in
> tests) looks a lot better than the 240GB SM863A with 10K IOPS.
> 
> I dug into my archives and for a wheezy (3.16 kernel) system I found
> results that had the about DC S3700 with 32K IOPS as per the specs and
> for a 845DC EVO 960GB 12K write IOPS, also as expected from the specs.
> 
> On newer HW with recent kernels (4.9) and fio (I suspect the later)
> things have changed to the point that the same fio command line as in the
> old tests now gives me results of over 70K IOPS for both 400GB DC S3710s
> and 960GB SM863A, both way higher than the specs.
> It seems to be basically CPU/IRQ bound at that point, leading me to
> believe that "--direct=1" no longer means the same thing.
> Adding "--sync=1" to the fio command things become more sane, but are still
> odd and partially higher than expected.


OK but the 845DC EVO is pretty old and also TLC and not MLC i don't
think you can compare them.


>> Have you disabled te FLUSH command for the Samsung ones?
> How would one do that?
> And since they have supposed full power loss protection, why wouldn't that
> be the default?

Intel has this as a default but Samsung did not. I think it's a
different kind of handling stuff.

Intel says - hey we have a capicitor just ignore the flush command.
Samsung says - hey we got a flush command do what the user wants and
flush all cache.

>> They don't skip the command automatically like the Intel do. Sadly the
>> Samsung SM863 got more expensive over the last months. They were a lot
>> cheaper  in the first month of 2016. May be the 2,5" optane intel ssds
>> will change the game.
>>
> The Optane offerings right now leave me rather unimpressed at 65K write
> IOPS and 290MB write speed for their best (32GB) model. 
> Not a fit for filestore journals given the write speed and not so
> much for the DB part of Bluestore either.

May be SSD DC S4600 will be interesting i'm always using the 2TB models.
I don't know the pricing yet.

>>>> but was wondering if anyone has seen a positive
>>>> impact from also using PCIe journals (e.g. Intel P3700 or even the
>>>> older 910 series) in front of such SSDs?
>>>>  
>>> NVMe journals (or WAL and DB space for Bluestore) are nice and can
>>> certainly help, especially if Ceph is tuned accordingly.
>>> Avoid non DC NVMes, I doubt you can still get 910s, they are officially
>>> EOL.
>>> You want to match capabilities and endurances, a DC P3700 800GB would be
>>> an OK match for 3-4 SM863a 960GB for example.   
>>
>> That's a good point but makes the cluster more expensive. Currently
>> while using filestore i use one SSD for journal and data which works fine.
>>
> Inline is fine if it fits your use case and the reduction in endurance is
> also calculated in and/or compensated for.
> I do the same with DC S3610s (very similar to the SM863As) on my
> cache-tier nodes. 

Currently i'm always observing much higher life time with ceph than the
manufactors tell me. The wear out indicators or lifetime remaining smart
values are much higher than expected.

>> With bluestore we've block, db and wal so we need 3 block devices per
>> OSD. If we need one PCIe or NVMe device per 3-4 devices it get's much
>> more expensive per host - currently running 10 OSDs / SSDs per Node.
>>
> Well, the OP was asking for performance, so price obviously goes up.
> If you're running SSD OSDs you can put all 3 on the same device and should
> be no worse off than before with filestore.
> Keep in mind that small writes also get "journaled" on the DB part, so
> double writes and endurance may not improve depending on your write
> patterns.

Sure but it's always nice to get more performance for the same price and
i was thinking if this is possible with bluestore.

> Something really fast for the WAL would likely help, but I have zero
> experience and very few written reports here to base that on.
>>> Have you already done tests how he performance changes with bluestore
>> while putting all 3 block devices on the same ssd?
>>
> Nope, and given my test clusters, it's likely going to be a while before I
> do anything with Bluestore on SSDs, never mind NVMes (of which I have none
> as nothing we do requires them at this point). 

Yes just thinking about a test cluster or designing the next one
optimized for bluestore.

But i',m still not sure if Luminous is the right one to start with or if
i should wait for the next major release before trying bluestore.

Greets,
Stefan
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PCIe journal benefit for SSD OSDs

Reply via email to