Is c) the bcache solution? real life experience - unless you are really beating an enterprise ssd with writes - they last very,very long and even when failure happens- you can typically see it by the wear levels in smart months before.
I would go for c) but if possible add one more nvme to each host - we have a 9-hdd+3-ssd scenario here. Jesper Sent from myMail for iOS Monday, 18 November 2019, 07.49 +0100 from kristof.cou...@gmail.com <kristof.cou...@gmail.com>: >Hi all, > >Thanks for the feedback. >Though, just to be sure: > >1. There is no 30GB limit if I understand correctly for the RocksDB size. If >metadata crosses that barrier, the L4 part will spillover to the primary >device? Or will it just move the RocksDB completely? Or will it just stop and >indicate it's full? >2. Since the WAL will also be written to that device, I assume a few >additional GB's is still usefull... > >With my setup (13x 14TB + 2 NVMe of 1.6TB / host, 10 hosts) I have multiple >possible scenario's: >- Assigning 35GB of space of the NVMe disk (30GB for DB, 5 spare) would result >in only 455GB being used (13 x 35GB). This is a pity, since I have 3.2TB of >NVMe disk space... > >Options line-up: > >Option a : Not using the NVMe for block.db storage, but as RGW metadata pool. >Advantages: >- Impact of 1 defect NVMe is limited. >- Fast storage for the metadata pool. >Disadvantage: >- RocksDB for each OSD is on the primary disk, resulting in slower performance >of each OSD. > >Option b: Hardware mirror of the NVMe drive >Advantages: >- Impact of 1 defect NVMe is limited >- Fast KV lookup for each OSD >Disadvantage: >- I/O to NVMe is serialized for all OSDs on 1 host. Though the NVMe are fast, >I imagine that there still is an impact. >- 1 TB of NVMe is not used / host > >Option c: Split the NVMe's accross the OSD >Advantages: >- Fast RockDB access - up to L3 (assuming spillover does it job) >Disadvantage: >- 1 defect NVMe impacts max 7 OSDs (1 NVMe assigned to 7 or 6 OSD daemons per >host) >- 2.7TB of NVMe space not used per host > >Option d: 1 NVMe disk for OSDs, 1 for RGW metadata pool >Advantages: >- Fast RockDB access - up to L3 >- Fast RGW metadata pool (though limited to 5,3TB (raw pool size will be 16TB, >divided by 3 due to replication) I assume this already gives some possibilities >Disadvantages: >- 1 defect NVMe might impact a complete host (all OSDs might be using it for >the RockDB storage) >- 1 TB of NVMe is not used > >Though menu to choose from, each with it possibilities... The initial idea was >too assign 200GB per OSD of the NVMe space per OSD, but this would result in a >lot of unused space. I don't know if there is anything on the roadmap to adapt >the RocksDB sizing to make better use of the available NVMe disk space. >With all the information, I would assume that the best option would be option >A . Since we will be using erasure coding for the RGW data pool (k=6, m=3), >the impact of a defect NVMe would be too significant. The other alternative >would be option b, but then again we would be dealing with HW raid which is >against all Ceph design rules. > >Any other options or (dis)advantages I missed? Or any other opinions to choose >another option? > >Regards, > >Kristof >Op vr 15 nov. 2019 om 18:22 schreef < vita...@yourcmc.ru >: >>Use 30 GB for all OSDs. Other values are pointless, because >>https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing >> >>You can use the rest of free NVMe space for bcache - it's much better >>than just allocating it for block.db. >_______________________________________________ >ceph-users mailing list >ceph-users@lists.ceph.com >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com