Ah, our old friend the P5316.

A few things to remember about these:

* 64KB IU means that you'll burn through endurance if you do a lot of writes 
smaller than that.  The firmware will try to coalesce smaller writes, 
especially if they're sequential.  You probably want to keep your RGW / CephFS 
index / medata pools on other media.


* With Quincy or later and a reasonably recent kernel you can set 
bluestore_use_optimal_io_size_for_min_alloc_size to true and OSDs deployed on 
these should automatically be created with a 64KB min_alloc_size.  If you're 
writing a lot of objects smaller than, say, 256KB -- especially if using EC -- 
a more nuanced approach may be warranted.  ISTR that your data are large 
sequential files, so probably you can exploit this.  For sure you want these 
OSDs to not have the default 4KB min_alloc_size; that would result in lowered 
write performance and especially endurance burn.  The min_alloc_size cannot be 
changed after an OSD is created; instead one would need to destroy and recreate.

cf. https://github.com/ceph/ceph/pulls?q=is%3Apr+author%3Acurtbruns

https://www.youtube.com/watch?v=w91e0EjWD6E
Optimizing RGW Object Storage Mixed Media through Storage Classes and Lua 
Scripting
youtube.com




> On Oct 24, 2023, at 11:42, Matt Larson <[email protected]> wrote:
> 
> I am looking to create a new pool that would be backed by a particular set
> of drives that are larger nVME SSDs (Intel SSDPF2NV153TZ, 15TB drives).
> Particularly, I am wondering about what is the best way to move devices
> from one pool and to direct them to be used in a new pool to be created. In
> this case, the documentation suggests I could want to assign them to a new
> device-class and have a placement rule that targets that device-class in
> the new pool.

If you're using cephadm / ceph orch you can craft an OSD spec that uses or 
ignores drives based on size or model.

Multiple pools can share OSDs, for your use-case though you probably don't want 
to.

> 
> Currently the Ceph cluster has two device classes 'hdd' and 'ssd', and the
> larger 15TB drives were automatically assigned to the 'ssd' device class
> that is in use by a different pool. The `ssd` device classes are used in a
> placement rule targeting that class.

The names of device classes are actually semi-arbitrary.  The above distinction 
is made on the basis of whether or not the kernel believes a given device to 
rotate.


> The documentation describes that I could set a device class for an OSD with
> a command like:
> 
> `ceph osd crush set-device-class CLASS OSD_ID [OSD_ID ..]`
> 
> Class names can be arbitrary strings like 'big_nvme".  

or "qlc"

> Before setting a new
> device class to an OSD that already has an assigned device class, should
> use `ceph osd crush rm-device-class ssd osd.XX`.

Yep.  I suspect that's a guardrail to prevent inadvertently trampling.

> 
> Can I proceed to directly remove these OSDs from the current device class
> and assign to a new device class?

Carpe NAND!

> Should they be moved one by one? What is
> the way to safely protect data from the existing pool that they are mapped
> to?

Are there other SSDs in said existing pool?  If you reassign all of these, will 
there be enough survivors to meet replication policy and hold all the data?

One by one would be safe.  Doing more than one might be faster and more 
efficient, depending on your hardware and topology.  For sure you don't want to 
reassign more than one per CRUSH failure domain at a time (host, rack, depends 
on your setup).  If your topology, RAM, and clients are amenable, you could do 
all OSDs in a single failure domain at once, then proceed to the next only 
after all PGs are active+clean.

> 
> Thanks,
>  Matt
> 
> -- 
> Matt Larson, PhD
> Madison, WI  53705 U.S.A.
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to