We are starting to use 18TB spindles, have loads of cold data and only a thin 
layer of hot data. One 4/8TB NVMe drive as a cache in front of 6x18TB will 
provide close to or even matching SSD performance for the hot data at a 
reasonable extra cost per TB storage. My plan is to wait for 1-2 more years for 
prices for PCI-NVMe to drop and then start using this method. The second 
advantage is, that one can continue to deploy collocated HDD OSDs as WAL/DB 
will certainly land and stay in cache. The cache can be added to existing OSDs 
without redeployment. In addition, dm-cache uses a hit count method for 
computing promotion to cache, which works very different from promotion to ceph 
cache pools. Dm-cache can afford that due to its local nature. In particular, 
it doesn't promote on just 1 access, which means that a weekly or monthly 
backup will not flush the entire cache every time.

All SSD pools for this data (ceph-fs in EC pool on HDD) will be unaffordable to 
us for a long time. Not to mention that these large SSDs are almost certainly 
QLC, which have much less sustained throughput compared with the 18TB He-drives 
(they have higher IOP/s though, which is not so relevant for our FS use 
workloads). The cache method will provide at least the additional IOP/s that 
WAB/DB devices would, but due to its size also data caching. We need to go 
NVMe, because the servers we plan to use (R740xd2) provide the largest capacity 
configuration with 24xHDD+4xPCI NVMe. You can either choose 2 extra drives or 4 
PCI NVMe, but not both. So, NVMe cannot be exchanged by fast SSDs as they would 
eat drive slots.

There were a few threads over the past 1-2 years where people dropped in some 
of these observations and I just took note of it. It is used in production 
already and from what I got people are happy with it. Much easier than WAL/DB 
partitions plus all the sizing problems for L0/L1/... are sorted trivially. 
With the size of NVMe growing rapidly beyond what WAL/DB devices can utilize 
and since LVM is the new OSD device, using LVM dm-cache seems to be the way 
forward for me.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Anthony D'Atri <[email protected]>
Sent: 16 November 2020 03:00:38
To: Frank Schilder
Subject: Re: [ceph-users] Re: which of cpu frequency and number of threads 
servers osd better?

Thanks.  I’m curious how the economics for that compare with just using all 
SSDs:

* HDDs are cheaper
* But colo SSDs are operationally simpler
* And depending on configuration you can provision a cheaper HBA


> On Nov 14, 2020, at 2:04 AM, Frank Schilder <[email protected]> wrote:
>
> My plan is to use at least 500GB NVMe per HDD OSD. I have not started that 
> yet, but there are threads of other people sharing their experience. If you 
> go beyond 300GB per OSD, apparently the WAL/DB options cannot really use the 
> extra capacity. With dm-cache or the like you would additionally start 
> holding hot data in cache.
>
> Ideally, I can split a 4TB or even a 8TB NVMe over 6 OSDs.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Anthony D'Atri <[email protected]>
> Sent: 14 November 2020 10:57:57
> To: Frank Schilder
> Subject: Re: [ceph-users] Re: which of cpu frequency and number of threads 
> servers osd better?
>
> Guten Tag.
>
>> My plan for the future is to use dm-cache for LVM OSDs instead of WAL/DB 
>> device.
>
> Do you have any insights into the benefits of that approach instead of 
> WAL/DB, and of dm-cache vs bcache vs dm-writecache vs … ?  And any for sizing 
> the cache device and handling failures?  Presumably the DB will be active 
> enough that it will persist in the cache, so sizing should be at a minimum 
> that to hold 2 copies of the DB to accomodate compaction?
>
> I have an existing RGW cluster on HDDs that utilizes a cache tier; the high 
> water mark is set fairly low so that it doesn’t fill up, something that 
> apparently happened last Christmas.  I’ve been wanting to get a feel for OSD 
> cache as an alternative to deprecated and fussy cache tiering, as well as 
> something like a Varnish cache on RGW load balancers to short-circult small 
> requests.
>
> — Anthony
>
>
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to