Re: [ceph-users] NVMe disk - size

2019-11-17 Thread Lars Täuber
Hi Kristof,

may I add another choice?
I configured my SSDs this way.

Every host for OSDs has two fast and durable SSDs.
Both SSDs are in one RAID1 which then is split up into LVs.

I took 58GB for DB & WAL (and space for a special action by the DB (was it 
compaction?)) for each OSD.
Then there where some hundreds of GB left on this RAID1 which I took to form a 
faster SSD-OSD.
This is put into its own class of OSDs.

So I have (slower) pools put onto OSDs of class "hdd" and (faster) pools put 
onto OSDs of class "ssd".
The faster pools are used for metadata of CephFS.

Good luck,
Lars


Mon, 18 Nov 2019 07:46:23 +0100
Kristof Coucke  ==> vita...@yourcmc.ru :
> Hi all,
> 
> Thanks for the feedback.
> Though, just to be sure:
> 
> 1. There is no 30GB limit if I understand correctly for the RocksDB size.
> If metadata crosses that barrier, the L4 part will spillover to the primary
> device? Or will it just move the RocksDB completely? Or will it just stop
> and indicate it's full?
> 2. Since the WAL will also be written to that device, I assume a few
> additional GB's is still usefull...
> 
> With my setup (13x 14TB + 2 NVMe of 1.6TB / host, 10 hosts) I have multiple
> possible scenario's:
> - Assigning 35GB of space of the NVMe disk (30GB for DB, 5 spare) would
> result in only 455GB being used (13 x 35GB). This is a pity, since I have
> 3.2TB of NVMe disk space...
> 
> Options line-up:
> 
> *Option a*: Not using the NVMe for block.db storage, but as RGW metadata
> pool.
> Advantages:
> - Impact of 1 defect NVMe is limited.
> - Fast storage for the metadata pool.
> Disadvantage:
> - RocksDB for each OSD is on the primary disk, resulting in slower
> performance of each OSD.
> 
> *Option b: *Hardware mirror of the NVMe drive
> Advantages:
> - Impact of 1 defect NVMe is limited
> - Fast KV lookup for each OSD
> Disadvantage:
> - I/O to NVMe is serialized for all OSDs on 1 host. Though the NVMe are
> fast, I imagine that there still is an impact.
> - 1 TB of NVMe is not used / host
> 
> *Option c: *Split the NVMe's accross the OSD
> Advantages:
> - Fast RockDB access - up to L3 (assuming spillover does it job)
> Disadvantage:
> - 1 defect NVMe impacts max 7 OSDs (1 NVMe assigned to 7 or 6 OSD daemons
> per host)
> - 2.7TB of NVMe space not used per host
> 
> *Option d: *1 NVMe disk for OSDs, 1 for RGW metadata pool
> Advantages:
> - Fast RockDB access - up to L3
> - Fast RGW metadata pool (though limited to 5,3TB (raw pool size will be
> 16TB, divided by 3 due to replication) I assume this already gives some
> possibilities
> Disadvantages:
> - 1 defect NVMe might impact a complete host (all OSDs might be using it
> for the RockDB storage)
> - 1 TB of NVMe is not used
> 
> Though menu to choose from, each with it possibilities... The initial idea
> was too assign 200GB per OSD of the NVMe space per OSD, but this would
> result in a lot of unused space. I don't know if there is anything on the
> roadmap to adapt the RocksDB sizing to make better use of the available
> NVMe disk space.
> With all the information, I would assume that the best option would be *option
> A*. Since we will be using erasure coding for the RGW data pool (k=6, m=3),
> the impact of a defect NVMe would be too significant. The other alternative
> would be option b, but then again we would be dealing with HW raid which is
> against all Ceph design rules.
> 
> Any other options or (dis)advantages I missed? Or any other opinions to
> choose another option?
> 
> Regards,
> 
> Kristof
> 
> Op vr 15 nov. 2019 om 18:22 schreef :
> 
> > Use 30 GB for all OSDs. Other values are pointless, because
> > https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing
> >
> > You can use the rest of free NVMe space for bcache - it's much better
> > than just allocating it for block.db.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-17 Thread jesper

Is c) the bcache solution?

real life experience - unless you are really beating an enterprise ssd with 
writes - they last very,very long and even when failure happens- you can 
typically see it by the wear levels in smart months before.

I would go for c) but if possible add one more nvme to each host - we have a 
9-hdd+3-ssd scenario here.

Jesper



Sent from myMail for iOS


Monday, 18 November 2019, 07.49 +0100 from kristof.cou...@gmail.com  
:
>Hi all,
>
>Thanks for the feedback.
>Though, just to be sure:
>
>1. There is no 30GB limit if I understand correctly for the RocksDB size. If 
>metadata crosses that barrier, the L4 part will spillover to the primary 
>device? Or will it just move the RocksDB completely? Or will it just stop and 
>indicate it's full?
>2. Since the WAL will also be written to that device, I assume a few 
>additional GB's is still usefull...
>
>With my setup (13x 14TB + 2 NVMe of 1.6TB / host, 10 hosts) I have multiple 
>possible scenario's:
>- Assigning 35GB of space of the NVMe disk (30GB for DB, 5 spare) would result 
>in only 455GB being used (13 x 35GB). This is a pity, since I have 3.2TB of 
>NVMe disk space...
>
>Options line-up:
>
>Option a : Not using the NVMe for block.db storage, but as RGW metadata pool.
>Advantages:
>- Impact of 1 defect NVMe is limited.
>- Fast storage for the metadata pool.
>Disadvantage:
>- RocksDB for each OSD is on the primary disk, resulting in slower performance 
>of each OSD.
>
>Option b:  Hardware mirror of the NVMe drive
>Advantages:
>- Impact of 1 defect NVMe is limited
>- Fast KV lookup for each OSD
>Disadvantage:
>- I/O to NVMe is serialized for all OSDs on 1 host. Though the NVMe are fast, 
>I imagine that there still is an impact.
>- 1 TB of NVMe is not used / host
>
>Option c:  Split the NVMe's accross the OSD
>Advantages:
>- Fast RockDB access - up to L3 (assuming spillover does it job)
>Disadvantage:
>- 1 defect NVMe impacts max 7 OSDs (1 NVMe assigned to 7 or 6 OSD daemons per 
>host)
>- 2.7TB of NVMe space not used per host
>
>Option d:  1 NVMe disk for OSDs, 1 for RGW metadata pool
>Advantages:
>- Fast RockDB access - up to L3
>- Fast RGW metadata pool (though limited to 5,3TB (raw pool size will be 16TB, 
>divided by 3 due to replication) I assume this already gives some possibilities
>Disadvantages:
>- 1 defect NVMe might impact a complete host (all OSDs might be using it for 
>the RockDB storage)
>- 1 TB of NVMe is not used
>
>Though menu to choose from, each with it possibilities... The initial idea was 
>too assign 200GB per OSD of the NVMe space per OSD, but this would result in a 
>lot of unused space. I don't know if there is anything on the roadmap to adapt 
>the RocksDB sizing to make better use of the available NVMe disk space.
>With all the information, I would assume that the best option would be  option 
>A . Since we will be using erasure coding for the RGW data pool (k=6, m=3), 
>the impact of a defect NVMe would be too significant. The other alternative 
>would be option b, but then again we would be dealing with HW raid which is 
>against all Ceph design rules.
>
>Any other options or (dis)advantages I missed? Or any other opinions to choose 
>another option?
>
>Regards,
>
>Kristof
>Op vr 15 nov. 2019 om 18:22 schreef < vita...@yourcmc.ru >:
>>Use 30 GB for all OSDs. Other values are pointless, because 
>>https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing
>>
>>You can use the rest of free NVMe space for bcache - it's much better 
>>than just allocating it for block.db.
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-17 Thread Kristof Coucke
Hi all,

Thanks for the feedback.
Though, just to be sure:

1. There is no 30GB limit if I understand correctly for the RocksDB size.
If metadata crosses that barrier, the L4 part will spillover to the primary
device? Or will it just move the RocksDB completely? Or will it just stop
and indicate it's full?
2. Since the WAL will also be written to that device, I assume a few
additional GB's is still usefull...

With my setup (13x 14TB + 2 NVMe of 1.6TB / host, 10 hosts) I have multiple
possible scenario's:
- Assigning 35GB of space of the NVMe disk (30GB for DB, 5 spare) would
result in only 455GB being used (13 x 35GB). This is a pity, since I have
3.2TB of NVMe disk space...

Options line-up:

*Option a*: Not using the NVMe for block.db storage, but as RGW metadata
pool.
Advantages:
- Impact of 1 defect NVMe is limited.
- Fast storage for the metadata pool.
Disadvantage:
- RocksDB for each OSD is on the primary disk, resulting in slower
performance of each OSD.

*Option b: *Hardware mirror of the NVMe drive
Advantages:
- Impact of 1 defect NVMe is limited
- Fast KV lookup for each OSD
Disadvantage:
- I/O to NVMe is serialized for all OSDs on 1 host. Though the NVMe are
fast, I imagine that there still is an impact.
- 1 TB of NVMe is not used / host

*Option c: *Split the NVMe's accross the OSD
Advantages:
- Fast RockDB access - up to L3 (assuming spillover does it job)
Disadvantage:
- 1 defect NVMe impacts max 7 OSDs (1 NVMe assigned to 7 or 6 OSD daemons
per host)
- 2.7TB of NVMe space not used per host

*Option d: *1 NVMe disk for OSDs, 1 for RGW metadata pool
Advantages:
- Fast RockDB access - up to L3
- Fast RGW metadata pool (though limited to 5,3TB (raw pool size will be
16TB, divided by 3 due to replication) I assume this already gives some
possibilities
Disadvantages:
- 1 defect NVMe might impact a complete host (all OSDs might be using it
for the RockDB storage)
- 1 TB of NVMe is not used

Though menu to choose from, each with it possibilities... The initial idea
was too assign 200GB per OSD of the NVMe space per OSD, but this would
result in a lot of unused space. I don't know if there is anything on the
roadmap to adapt the RocksDB sizing to make better use of the available
NVMe disk space.
With all the information, I would assume that the best option would be *option
A*. Since we will be using erasure coding for the RGW data pool (k=6, m=3),
the impact of a defect NVMe would be too significant. The other alternative
would be option b, but then again we would be dealing with HW raid which is
against all Ceph design rules.

Any other options or (dis)advantages I missed? Or any other opinions to
choose another option?

Regards,

Kristof

Op vr 15 nov. 2019 om 18:22 schreef :

> Use 30 GB for all OSDs. Other values are pointless, because
> https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing
>
> You can use the rest of free NVMe space for bcache - it's much better
> than just allocating it for block.db.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com