Re: [ceph-users] NVMe disk - size

jesper Sun, 17 Nov 2019 23:15:50 -0800

Is c) the bcache solution?

real life experience - unless you are really beating an enterprise ssd with 
writes - they last very,very long and even when failure happens- you can 
typically see it by the wear levels in smart months before.


I would go for c) but if possible add one more nvme to each host - we have a 
9-hdd+3-ssd scenario here.

Jesper



Sent from myMail for iOS


Monday, 18 November 2019, 07.49 +0100 from kristof.cou...@gmail.com  
<kristof.cou...@gmail.com>:
>Hi all,
>
>Thanks for the feedback.
>Though, just to be sure:
>
>1. There is no 30GB limit if I understand correctly for the RocksDB size. If 
>metadata crosses that barrier, the L4 part will spillover to the primary 
>device? Or will it just move the RocksDB completely? Or will it just stop and 
>indicate it's full?
>2. Since the WAL will also be written to that device, I assume a few 
>additional GB's is still usefull...
>
>With my setup (13x 14TB + 2 NVMe of 1.6TB / host, 10 hosts) I have multiple 
>possible scenario's:
>- Assigning 35GB of space of the NVMe disk (30GB for DB, 5 spare) would result 
>in only 455GB being used (13 x 35GB). This is a pity, since I have 3.2TB of 
>NVMe disk space...
>
>Options line-up:
>
>Option a : Not using the NVMe for block.db storage, but as RGW metadata pool.
>Advantages:
>- Impact of 1 defect NVMe is limited.
>- Fast storage for the metadata pool.
>Disadvantage:
>- RocksDB for each OSD is on the primary disk, resulting in slower performance 
>of each OSD.
>
>Option b:  Hardware mirror of the NVMe drive
>Advantages:
>- Impact of 1 defect NVMe is limited
>- Fast KV lookup for each OSD
>Disadvantage:
>- I/O to NVMe is serialized for all OSDs on 1 host. Though the NVMe are fast, 
>I imagine that there still is an impact.
>- 1 TB of NVMe is not used / host
>
>Option c:  Split the NVMe's accross the OSD
>Advantages:
>- Fast RockDB access - up to L3 (assuming spillover does it job)
>Disadvantage:
>- 1 defect NVMe impacts max 7 OSDs (1 NVMe assigned to 7 or 6 OSD daemons per 
>host)
>- 2.7TB of NVMe space not used per host
>
>Option d:  1 NVMe disk for OSDs, 1 for RGW metadata pool
>Advantages:
>- Fast RockDB access - up to L3
>- Fast RGW metadata pool (though limited to 5,3TB (raw pool size will be 16TB, 
>divided by 3 due to replication) I assume this already gives some possibilities
>Disadvantages:
>- 1 defect NVMe might impact a complete host (all OSDs might be using it for 
>the RockDB storage)
>- 1 TB of NVMe is not used
>
>Though menu to choose from, each with it possibilities... The initial idea was 
>too assign 200GB per OSD of the NVMe space per OSD, but this would result in a 
>lot of unused space. I don't know if there is anything on the roadmap to adapt 
>the RocksDB sizing to make better use of the available NVMe disk space.
>With all the information, I would assume that the best option would be  option 
>A . Since we will be using erasure coding for the RGW data pool (k=6, m=3), 
>the impact of a defect NVMe would be too significant. The other alternative 
>would be option b, but then again we would be dealing with HW raid which is 
>against all Ceph design rules.
>
>Any other options or (dis)advantages I missed? Or any other opinions to choose 
>another option?
>
>Regards,
>
>Kristof
>Op vr 15 nov. 2019 om 18:22 schreef < vita...@yourcmc.ru >:
>>Use 30 GB for all OSDs. Other values are pointless, because 
>>https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing
>>
>>You can use the rest of free NVMe space for bcache - it's much better 
>>than just allocating it for block.db.
>_______________________________________________
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NVMe disk - size

Reply via email to