Re: [ceph-users] New CRUSH device class questions
On Wed, Aug 7, 2019 at 7:05 AM Paul Emmerich wrote: > ~ is the internal implementation of device classes. Internally it's > still using separate roots, that's how it stays compatible with older > clients that don't know about device classes. > That makes sense. > And since it wasn't mentioned here yet: consider upgrading to Nautilus > to benefit from the new and improved accounting for metadata space. > You'll be able to see how much space is used for metadata and quotas > should work properly for metadata usage. > I think I'm not explaining this well and it is confusing people. I don't want to limit the size of the metadata pool, I also don't want to limit the size of the data pool as the cluster flexibility could cause the quota to be out of date at anytime and probably useless (since we want to use as much space as possible for data). I would like to reserve space for the metadata pool so that no other pool can touch it, much like when you thick provision a VM disk file. It is guaranteed for that entity an no one else can use it, even if it is mostly empty. So far people have only told me how to limit the space of a pool, which is not what I'm looking for. Thank you, Robert LeBlanc Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New CRUSH device class questions
On Wed, Aug 7, 2019 at 9:30 AM Robert LeBlanc wrote: >> # ceph osd crush rule dump replicated_racks_nvme >> { >> "rule_id": 0, >> "rule_name": "replicated_racks_nvme", >> "ruleset": 0, >> "type": 1, >> "min_size": 1, >> "max_size": 10, >> "steps": [ >> { >> "op": "take", >> "item": -44, >> "item_name": "default~nvme"< >> }, >> { >> "op": "chooseleaf_firstn", >> "num": 0, >> "type": "rack" >> }, >> { >> "op": "emit" >> } >> ] >> } >> ``` > > > Yes, our HDD cluster is much like this, but not Luminous, so we created as > separate root with SSD OSD for the metadata and set up a CRUSH rule for the > metadata pool to be mapped to SSD. I understand that the CRUSH rule should > have a `step take default class ssd` which I don't see in your rule unless > the `~` in the item_name means device class. ~ is the internal implementation of device classes. Internally it's still using separate roots, that's how it stays compatible with older clients that don't know about device classes. And since it wasn't mentioned here yet: consider upgrading to Nautilus to benefit from the new and improved accounting for metadata space. You'll be able to see how much space is used for metadata and quotas should work properly for metadata usage. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 > > Thanks > > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New CRUSH device class questions
On 8/7/19 2:30 PM, Robert LeBlanc wrote: ... plus 11 more hosts just like this Interesting. Please paste full `ceph osd df tree`. What is actually your NVMe models? Yes, our HDD cluster is much like this, but not Luminous, so we created as separate root with SSD OSD for the metadata and set up a CRUSH rule for the metadata pool to be mapped to SSD. I understand that the CRUSH rule should have a `step take default class ssd` which I don't see in your rule unless the `~` in the item_name means device class. Indeed, this is a device class. And new crush rule may be created like this `ceph osd crush rule create-replicated `, for me it is: `ceph osd crush rule create-replicated replicated_racks_nvme default rack nvme` k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New CRUSH device class questions
On Wed, Aug 7, 2019 at 12:08 AM Konstantin Shalygin wrote: > On 8/7/19 1:40 PM, Robert LeBlanc wrote: > > > Maybe it's the lateness of the day, but I'm not sure how to do that. > > Do you have an example where all the OSDs are of class ssd? > Can't parse what you mean. You always should paste your `ceph osd tree` > first. > Our 'ceph osd tree' is like this: ID CLASS WEIGHTTYPE NAMESTATUS REWEIGHT PRI-AFF -1 892.21326 root default -369.16382 host sun-pcs01-osd01 0 ssd 3.49309 osd.0up 1.0 1.0 1 ssd 3.42329 osd.1up 0.87482 1.0 2 ssd 3.49309 osd.2up 0.88989 1.0 3 ssd 3.42329 osd.3up 0.94989 1.0 4 ssd 3.49309 osd.4up 0.93993 1.0 5 ssd 3.42329 osd.5up 1.0 1.0 6 ssd 3.49309 osd.6up 0.89490 1.0 7 ssd 3.42329 osd.7up 1.0 1.0 8 ssd 3.49309 osd.8up 0.89482 1.0 9 ssd 3.42329 osd.9up 1.0 1.0 100 ssd 3.49309 osd.100 up 1.0 1.0 101 ssd 3.42329 osd.101 up 1.0 1.0 102 ssd 3.49309 osd.102 up 1.0 1.0 103 ssd 3.42329 osd.103 up 0.81482 1.0 104 ssd 3.49309 osd.104 up 0.87973 1.0 105 ssd 3.42329 osd.105 up 0.86485 1.0 106 ssd 3.49309 osd.106 up 0.79965 1.0 107 ssd 3.42329 osd.107 up 1.0 1.0 108 ssd 3.49309 osd.108 up 1.0 1.0 109 ssd 3.42329 osd.109 up 1.0 1.0 -562.24744 host sun-pcs01-osd02 10 ssd 3.49309 osd.10 up 1.0 1.0 11 ssd 3.42329 osd.11 up 0.72473 1.0 12 ssd 3.49309 osd.12 up 1.0 1.0 13 ssd 3.42329 osd.13 up 0.78979 1.0 14 ssd 3.49309 osd.14 up 0.98961 1.0 15 ssd 3.42329 osd.15 up 1.0 1.0 16 ssd 3.49309 osd.16 up 0.96495 1.0 17 ssd 3.42329 osd.17 up 0.94994 1.0 18 ssd 3.49309 osd.18 up 1.0 1.0 19 ssd 3.42329 osd.19 up 0.80481 1.0 110 ssd 3.49309 osd.110 up 0.97998 1.0 111 ssd 3.42329 osd.111 up 1.0 1.0 112 ssd 3.49309 osd.112 up 1.0 1.0 113 ssd 3.42329 osd.113 up 0.72974 1.0 116 ssd 3.49309 osd.116 up 0.91992 1.0 117 ssd 3.42329 osd.117 up 0.96997 1.0 118 ssd 3.49309 osd.118 up 0.93959 1.0 119 ssd 3.42329 osd.119 up 0.94481 1.0 ... plus 11 more hosts just like this How do you single out one OSD from each host for the metadata only and prevent data on that OSD when all the device classes are the same? It seems that you would need one OSD to be a different class to do that. It a previous email the conversation was: Is it possible to add a new device class like 'metadata'? Yes, but you don't need this. Just use your existing class with another crush ruleset. So, I'm trying to figure out how you use the existing class of 'ssd' with another CRUSH ruleset to accomplish the above. > > Yes, we can set quotas to limit space usage (or number objects), but > > you can not reserve some space that other pools can't use. The problem > > is if we set a quota for the CephFS data pool to the equivalent of 95% > > there are at least two scenario that make that quota useless. > > Of course. 95% of CephFS deployments is where meta_pool on flash drives > with enough space for this. > > > ``` > > pool 21 'fs_data' replicated size 3 min_size 2 crush_rule 4 object_hash > rjenkins pg_num 64 pgp_num 64 last_change 56870 flags hashpspool > stripe_width 0 application cephfs > pool 22 'fs_meta' replicated size 3 min_size 2 crush_rule 0 object_hash > rjenkins pg_num 16 pgp_num 16 last_change 56870 flags hashpspool > stripe_width 0 application cephfs > > ``` > > ``` > > # ceph osd crush rule dump replicated_racks_nvme > { > "rule_id": 0, > "rule_name": "replicated_racks_nvme", > "ruleset": 0, > "type": 1, > "min_size": 1, > "max_size": 10, > "steps": [ > { > "op": "take", > "item": -44, > "item_name": "default~nvme"< > }, > { > "op": "chooseleaf_firstn", > "num": 0, > "type": "rack" >
Re: [ceph-users] New CRUSH device class questions
On 8/7/19 1:40 PM, Robert LeBlanc wrote: Maybe it's the lateness of the day, but I'm not sure how to do that. Do you have an example where all the OSDs are of class ssd? Can't parse what you mean. You always should paste your `ceph osd tree` first. Yes, we can set quotas to limit space usage (or number objects), but you can not reserve some space that other pools can't use. The problem is if we set a quota for the CephFS data pool to the equivalent of 95% there are at least two scenario that make that quota useless. Of course. 95% of CephFS deployments is where meta_pool on flash drives with enough space for this. ``` pool 21 'fs_data' replicated size 3 min_size 2 crush_rule 4 object_hash rjenkins pg_num 64 pgp_num 64 last_change 56870 flags hashpspool stripe_width 0 application cephfs pool 22 'fs_meta' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 last_change 56870 flags hashpspool stripe_width 0 application cephfs ``` ``` # ceph osd crush rule dump replicated_racks_nvme { "rule_id": 0, "rule_name": "replicated_racks_nvme", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -44, "item_name": "default~nvme" < }, { "op": "chooseleaf_firstn", "num": 0, "type": "rack" }, { "op": "emit" } ] } ``` k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New CRUSH device class questions
On Tue, Aug 6, 2019 at 7:56 PM Konstantin Shalygin wrote: > Is it possible to add a new device class like 'metadata'? > > > Yes, but you don't need this. Just use your existing class with another > crush ruleset. > Maybe it's the lateness of the day, but I'm not sure how to do that. Do you have an example where all the OSDs are of class ssd? > If I set the device class manually, will it be overwritten when the OSD > boots up? > > > Nope. Classes assigned automatically when OSD is created, not boot'ed. > That's good to know. > I read https://ceph.com/community/new-luminous-crush-device-classes/ and it > mentions that Ceph automatically classifies into hdd, ssd, and nvme. Hence > the question. > > But it's not a magic. Sometimes drive can be sata ssd, but in kernel is > 'rotational'... > I see, so it's not looking to see if the device is in /sys/class/pci or something. > We will still have 13 OSDs, it will be overkill for space for metadata, but > since Ceph lacks a reserve space feature, we don't have many options. This > cluster is so fast that it can fill up in the blink of an eye. > > > Not true. You always can set per-pool quota in bytes, for example: > > * your meta is 1G; > > * your raw space is 300G; > > * your data is 90G; > > Set quota to your data pool: `ceph osd pool set-quota > max_bytes 96636762000` > Yes, we can set quotas to limit space usage (or number objects), but you can not reserve some space that other pools can't use. The problem is if we set a quota for the CephFS data pool to the equivalent of 95% there are at least two scenario that make that quota useless. 1. A host fails and the cluster recovers. The quota is now past the capacity of the cluster so if the data pool fills up, no pool can write. 2. The CephFS data pool is an erasure encoded pool, and it shares with a RGW data pool that is 3x rep. If more writes happen to the RGW data pool, then the quota will be past the capacity of the cluster. Both of these cause metadata operations to not be committed and cause lots of problems with CephFS (can't list a directory with a broken inode in it). We would prefer to get a truncated file, then a broken file system. I wrote a script that calculates 95% of the pool capacity and sets the quota if the current quota is 1% out of balance. This is run by cron every 5 minutes. If there is a way to reserve some capacity for a pool that no other pool can use, please provide an example. Think of reserved inode space in ext4/XFS/etc. Thank you. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New CRUSH device class questions
Is it possible to add a new device class like 'metadata'? Yes, but you don't need this. Just use your existing class with another crush ruleset. If I set the device class manually, will it be overwritten when the OSD boots up? Nope. Classes assigned automatically when OSD is created, not boot'ed. I readhttps://ceph.com/community/new-luminous-crush-device-classes/ and it mentions that Ceph automatically classifies into hdd, ssd, and nvme. Hence the question. But it's not a magic. Sometimes drive can be sata ssd, but in kernel is 'rotational'... We will still have 13 OSDs, it will be overkill for space for metadata, but since Ceph lacks a reserve space feature, we don't have many options. This cluster is so fast that it can fill up in the blink of an eye. Not true. You always can set per-pool quota in bytes, for example: * your meta is 1G; * your raw space is 300G; * your data is 90G; Set quota to your data pool: `ceph osd pool set-quota max_bytes 96636762000` k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New CRUSH device class questions
On Tue, Aug 6, 2019 at 11:11 AM Paul Emmerich wrote: > On Tue, Aug 6, 2019 at 7:45 PM Robert LeBlanc > wrote: > > We have a 12.2.8 luminous cluster with all NVMe and we want to take some > of the NVMe OSDs and allocate them strictly to metadata pools (we have a > problem with filling up this cluster and causing lingering metadata > problems, and this will guarantee space for metadata operations). > > Depending on the workload and metadata size: this might be a bad idea > as it reduces parallelism. > We will still have 13 OSDs, it will be overkill for space for metadata, but since Ceph lacks a reserve space feature, we don't have many options. This cluster is so fast that it can fill up in the blink of an eye. > In the past, we have done this the old-school way of creating a separate > root, but I wanted to see if we could leverage the device class function > instead. > > > > Right now all our devices show as ssd rather than nvme, but that is the > only class in this cluster. None of the device classes were manually set, > so is there a reason they were not detected as nvme? > > Ceph only distinguishes rotational vs. non-rotational for device > classes and device-specific configuration options > I read https://ceph.com/community/new-luminous-crush-device-classes/ and it mentions that Ceph automatically classifies into hdd, ssd, and nvme. Hence the question. > Is it possible to add a new device class like 'metadata'? > > yes, it's just a string, you can use whatever you want. For example, > we run a few setups that distinguish 2.5" and 3.5" HDDs. > > > > > > If I set the device class manually, will it be overwritten when the OSD > boots up? > > not sure about this, we run all of our setups with auto-updating on > start disabled > Do you know if 'osd crush location hook' can specify the device class? > Is what I'm trying to accomplish better done by the old-school separate > root and the osd_crush_location_hook (potentially using a file with a list > of partition UUIDs that should be in the metadata pool).? > > device classes are the way to go here Thanks! Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New CRUSH device class questions
On Tue, Aug 6, 2019 at 7:45 PM Robert LeBlanc wrote: > We have a 12.2.8 luminous cluster with all NVMe and we want to take some of > the NVMe OSDs and allocate them strictly to metadata pools (we have a problem > with filling up this cluster and causing lingering metadata problems, and > this will guarantee space for metadata operations). Depending on the workload and metadata size: this might be a bad idea as it reduces parallelism. > In the past, we have done this the old-school way of creating a separate > root, but I wanted to see if we could leverage the device class function > instead. > > Right now all our devices show as ssd rather than nvme, but that is the only > class in this cluster. None of the device classes were manually set, so is > there a reason they were not detected as nvme? Ceph only distinguishes rotational vs. non-rotational for device classes and device-specific configuration options > > Is it possible to add a new device class like 'metadata'? yes, it's just a string, you can use whatever you want. For example, we run a few setups that distinguish 2.5" and 3.5" HDDs. > > If I set the device class manually, will it be overwritten when the OSD boots > up? not sure about this, we run all of our setups with auto-updating on start disabled > Is what I'm trying to accomplish better done by the old-school separate root > and the osd_crush_location_hook (potentially using a file with a list of > partition UUIDs that should be in the metadata pool).? device classes are the way to go here -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 > Thank you, > Robert LeBlanc > > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] New CRUSH device class questions
We have a 12.2.8 luminous cluster with all NVMe and we want to take some of the NVMe OSDs and allocate them strictly to metadata pools (we have a problem with filling up this cluster and causing lingering metadata problems, and this will guarantee space for metadata operations). In the past, we have done this the old-school way of creating a separate root, but I wanted to see if we could leverage the device class function instead. Right now all our devices show as ssd rather than nvme, but that is the only class in this cluster. None of the device classes were manually set, so is there a reason they were not detected as nvme? Is it possible to add a new device class like 'metadata'? If I set the device class manually, will it be overwritten when the OSD boots up? Is what I'm trying to accomplish better done by the old-school separate root and the osd_crush_location_hook (potentially using a file with a list of partition UUIDs that should be in the metadata pool).? Any other options I may not be considering? Thank you, Robert LeBlanc Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com