Re: [ceph-users] New CRUSH device class questions

2019-08-12 Thread Robert LeBlanc
On Wed, Aug 7, 2019 at 7:05 AM Paul Emmerich  wrote:

> ~ is the internal implementation of device classes. Internally it's
> still using separate roots, that's how it stays compatible with older
> clients that don't know about device classes.
>

That makes sense.


> And since it wasn't mentioned here yet: consider upgrading to Nautilus
> to benefit from the new and improved accounting for metadata space.
> You'll be able to see how much space is used for metadata and quotas
> should work properly for metadata usage.
>

I think I'm not explaining this well and it is confusing people. I don't
want to limit the size of the metadata pool, I also don't want to limit the
size of the data pool as the cluster flexibility could cause the quota to
be out of date at anytime and probably useless (since we want to use as
much space as possible for data). I would like to reserve space for the
metadata pool so that no other pool can touch it, much like when you thick
provision a VM disk file. It is guaranteed for that entity an no one else
can use it, even if it is mostly empty. So far people have only told me how
to limit the space of a pool, which is not what I'm looking for.

Thank you,
Robert LeBlanc


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New CRUSH device class questions

2019-08-07 Thread Paul Emmerich
On Wed, Aug 7, 2019 at 9:30 AM Robert LeBlanc  wrote:

>> # ceph osd crush rule dump replicated_racks_nvme
>> {
>>  "rule_id": 0,
>>  "rule_name": "replicated_racks_nvme",
>>  "ruleset": 0,
>>  "type": 1,
>>  "min_size": 1,
>>  "max_size": 10,
>>  "steps": [
>>  {
>>  "op": "take",
>>  "item": -44,
>>  "item_name": "default~nvme"<
>>  },
>>  {
>>  "op": "chooseleaf_firstn",
>>  "num": 0,
>>  "type": "rack"
>>  },
>>  {
>>  "op": "emit"
>>  }
>>  ]
>> }
>> ```
>
>
> Yes, our HDD cluster is much like this, but not Luminous, so we created as 
> separate root with SSD OSD for the metadata and set up a CRUSH rule for the 
> metadata pool to be mapped to SSD. I understand that the CRUSH rule should 
> have a `step take default class ssd` which I don't see in your rule unless 
> the `~` in the item_name means device class.

~ is the internal implementation of device classes. Internally it's
still using separate roots, that's how it stays compatible with older
clients that don't know about device classes.

And since it wasn't mentioned here yet: consider upgrading to Nautilus
to benefit from the new and improved accounting for metadata space.
You'll be able to see how much space is used for metadata and quotas
should work properly for metadata usage.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

>
> Thanks
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New CRUSH device class questions

2019-08-07 Thread Konstantin Shalygin

On 8/7/19 2:30 PM, Robert LeBlanc wrote:

... plus 11 more hosts just like this


Interesting. Please paste full `ceph osd df tree`. What is actually your 
NVMe models?


Yes, our HDD cluster is much like this, but not Luminous, so we 
created as separate root with SSD OSD for the metadata and set up a 
CRUSH rule for the metadata pool to be mapped to SSD. I understand 
that the CRUSH rule should have a `step take default class ssd` which 
I don't see in your rule unless the `~` in the item_name means device 
class.

Indeed, this is a device class.


And new crush rule may be created like this `ceph osd crush rule 
create-replicated
`, for me it is: `ceph osd crush rule create-replicated 
replicated_racks_nvme default rack nvme`




k


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New CRUSH device class questions

2019-08-07 Thread Robert LeBlanc
On Wed, Aug 7, 2019 at 12:08 AM Konstantin Shalygin  wrote:

> On 8/7/19 1:40 PM, Robert LeBlanc wrote:
>
> > Maybe it's the lateness of the day, but I'm not sure how to do that.
> > Do you have an example where all the OSDs are of class ssd?
> Can't parse what you mean. You always should paste your `ceph osd tree`
> first.
>

Our 'ceph osd tree' is like this:
ID  CLASS WEIGHTTYPE NAMESTATUS REWEIGHT PRI-AFF
 -1   892.21326 root default
 -369.16382 host sun-pcs01-osd01
  0   ssd   3.49309 osd.0up  1.0 1.0
  1   ssd   3.42329 osd.1up  0.87482 1.0
  2   ssd   3.49309 osd.2up  0.88989 1.0
  3   ssd   3.42329 osd.3up  0.94989 1.0
  4   ssd   3.49309 osd.4up  0.93993 1.0
  5   ssd   3.42329 osd.5up  1.0 1.0
  6   ssd   3.49309 osd.6up  0.89490 1.0
  7   ssd   3.42329 osd.7up  1.0 1.0
  8   ssd   3.49309 osd.8up  0.89482 1.0
  9   ssd   3.42329 osd.9up  1.0 1.0
100   ssd   3.49309 osd.100  up  1.0 1.0
101   ssd   3.42329 osd.101  up  1.0 1.0
102   ssd   3.49309 osd.102  up  1.0 1.0
103   ssd   3.42329 osd.103  up  0.81482 1.0
104   ssd   3.49309 osd.104  up  0.87973 1.0
105   ssd   3.42329 osd.105  up  0.86485 1.0
106   ssd   3.49309 osd.106  up  0.79965 1.0
107   ssd   3.42329 osd.107  up  1.0 1.0
108   ssd   3.49309 osd.108  up  1.0 1.0
109   ssd   3.42329 osd.109  up  1.0 1.0
 -562.24744 host sun-pcs01-osd02
 10   ssd   3.49309 osd.10   up  1.0 1.0
 11   ssd   3.42329 osd.11   up  0.72473 1.0
 12   ssd   3.49309 osd.12   up  1.0 1.0
 13   ssd   3.42329 osd.13   up  0.78979 1.0
 14   ssd   3.49309 osd.14   up  0.98961 1.0
 15   ssd   3.42329 osd.15   up  1.0 1.0
 16   ssd   3.49309 osd.16   up  0.96495 1.0
 17   ssd   3.42329 osd.17   up  0.94994 1.0
 18   ssd   3.49309 osd.18   up  1.0 1.0
 19   ssd   3.42329 osd.19   up  0.80481 1.0
110   ssd   3.49309 osd.110  up  0.97998 1.0
111   ssd   3.42329 osd.111  up  1.0 1.0
112   ssd   3.49309 osd.112  up  1.0 1.0
113   ssd   3.42329 osd.113  up  0.72974 1.0
116   ssd   3.49309 osd.116  up  0.91992 1.0
117   ssd   3.42329 osd.117  up  0.96997 1.0
118   ssd   3.49309 osd.118  up  0.93959 1.0
119   ssd   3.42329 osd.119  up  0.94481 1.0
... plus 11 more hosts just like this

How do you single out one OSD from each host for the metadata only and
prevent data on that OSD when all the device classes are the same? It seems
that you would need one OSD to be a different class to do that. It a
previous email the conversation was:

Is it possible to add a new device class like 'metadata'?

Yes, but you don't need this. Just use your existing class with another
crush ruleset.


So, I'm trying to figure out how you use the existing class of 'ssd' with
another CRUSH ruleset to accomplish the above.


> > Yes, we can set quotas to limit space usage (or number objects), but
> > you can not reserve some space that other pools can't use. The problem
> > is if we set a quota for the CephFS data pool to the equivalent of 95%
> > there are at least two scenario that make that quota useless.
>
> Of course. 95% of CephFS deployments is where meta_pool on flash drives
> with enough space for this.
>
>
> ```
>
> pool 21 'fs_data' replicated size 3 min_size 2 crush_rule 4 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 56870 flags hashpspool
> stripe_width 0 application cephfs
> pool 22 'fs_meta' replicated size 3 min_size 2 crush_rule 0 object_hash
> rjenkins pg_num 16 pgp_num 16 last_change 56870 flags hashpspool
> stripe_width 0 application cephfs
>
> ```
>
> ```
>
> # ceph osd crush rule dump replicated_racks_nvme
> {
>  "rule_id": 0,
>  "rule_name": "replicated_racks_nvme",
>  "ruleset": 0,
>  "type": 1,
>  "min_size": 1,
>  "max_size": 10,
>  "steps": [
>  {
>  "op": "take",
>  "item": -44,
>  "item_name": "default~nvme"<
>  },
>  {
>  "op": "chooseleaf_firstn",
>  "num": 0,
>  "type": "rack"
> 

Re: [ceph-users] New CRUSH device class questions

2019-08-07 Thread Konstantin Shalygin

On 8/7/19 1:40 PM, Robert LeBlanc wrote:

Maybe it's the lateness of the day, but I'm not sure how to do that. 
Do you have an example where all the OSDs are of class ssd?
Can't parse what you mean. You always should paste your `ceph osd tree` 
first.


Yes, we can set quotas to limit space usage (or number objects), but 
you can not reserve some space that other pools can't use. The problem 
is if we set a quota for the CephFS data pool to the equivalent of 95% 
there are at least two scenario that make that quota useless.


Of course. 95% of CephFS deployments is where meta_pool on flash drives 
with enough space for this.



```

pool 21 'fs_data' replicated size 3 min_size 2 crush_rule 4 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 56870 flags hashpspool 
stripe_width 0 application cephfs
pool 22 'fs_meta' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 16 pgp_num 16 last_change 56870 flags hashpspool 
stripe_width 0 application cephfs


```

```

# ceph osd crush rule dump replicated_racks_nvme
{
    "rule_id": 0,
    "rule_name": "replicated_racks_nvme",
    "ruleset": 0,
    "type": 1,
    "min_size": 1,
    "max_size": 10,
    "steps": [
    {
    "op": "take",
    "item": -44,
    "item_name": "default~nvme"    <
    },
    {
    "op": "chooseleaf_firstn",
    "num": 0,
    "type": "rack"
    },
    {
    "op": "emit"
    }
    ]
}
```



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New CRUSH device class questions

2019-08-07 Thread Robert LeBlanc
On Tue, Aug 6, 2019 at 7:56 PM Konstantin Shalygin  wrote:

> Is it possible to add a new device class like 'metadata'?
>
>
> Yes, but you don't need this. Just use your existing class with another
> crush ruleset.
>

Maybe it's the lateness of the day, but I'm not sure how to do that. Do you
have an example where all the OSDs are of class ssd?

> If I set the device class manually, will it be overwritten when the OSD
> boots up?
>
>
> Nope. Classes assigned automatically when OSD is created, not boot'ed.
>

That's good to know.

> I read https://ceph.com/community/new-luminous-crush-device-classes/ and it
> mentions that Ceph automatically classifies into hdd, ssd, and nvme. Hence
> the question.
>
> But it's not a magic. Sometimes drive can be sata ssd, but in kernel is
> 'rotational'...
>
I see, so it's not looking to see if the device is in /sys/class/pci or
something.

> We will still have 13 OSDs, it will be overkill for space for metadata, but
> since Ceph lacks a reserve space feature, we don't have  many options. This
> cluster is so fast that it can fill up in the blink of an eye.
>
>
> Not true. You always can set per-pool quota in bytes, for example:
>
> * your meta is 1G;
>
> * your raw space is 300G;
>
> * your data is 90G;
>
> Set quota to your data pool: `ceph osd pool set-quota 
> max_bytes 96636762000`
>
Yes, we can set quotas to limit space usage (or number objects), but you
can not reserve some space that other pools can't use. The problem is if we
set a quota for the CephFS data pool to the equivalent of 95% there are at
least two scenario that make that quota useless.

1. A host fails and the cluster recovers. The quota is now past the
capacity of the cluster so if the data pool fills up, no pool can write.
2. The CephFS data pool is an erasure encoded pool, and it shares with a
RGW data pool that is 3x rep. If more writes happen to the RGW data pool,
then the quota will be past the capacity of the cluster.

Both of these cause metadata operations to not be committed and cause lots
of problems with CephFS (can't list a directory with a broken inode in it).
We would prefer to get a truncated file, then a broken file system.

I wrote a script that calculates 95% of the pool capacity and sets the
quota if the current quota is 1% out of balance. This is run by cron every
5 minutes.

If there is a way to reserve some capacity for a pool that no other pool
can use, please provide an example. Think of reserved inode space in
ext4/XFS/etc.

Thank you.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New CRUSH device class questions

2019-08-06 Thread Konstantin Shalygin

Is it possible to add a new device class like 'metadata'?


Yes, but you don't need this. Just use your existing class with another 
crush ruleset.




If I set the device class manually, will it be overwritten when the OSD
boots up?


Nope. Classes assigned automatically when OSD is created, not boot'ed.



I readhttps://ceph.com/community/new-luminous-crush-device-classes/  and it
mentions that Ceph automatically classifies into hdd, ssd, and nvme. Hence
the question.


But it's not a magic. Sometimes drive can be sata ssd, but in kernel is 
'rotational'...




We will still have 13 OSDs, it will be overkill for space for metadata, but
since Ceph lacks a reserve space feature, we don't have  many options. This
cluster is so fast that it can fill up in the blink of an eye.



Not true. You always can set per-pool quota in bytes, for example:

* your meta is 1G;

* your raw space is 300G;

* your data is 90G;

Set quota to your data pool: `ceph osd pool set-quota  
max_bytes 96636762000`





k



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New CRUSH device class questions

2019-08-06 Thread Robert LeBlanc
On Tue, Aug 6, 2019 at 11:11 AM Paul Emmerich 
wrote:

> On Tue, Aug 6, 2019 at 7:45 PM Robert LeBlanc 
> wrote:
> > We have a 12.2.8 luminous cluster with all NVMe and we want to take some
> of the NVMe OSDs and allocate them strictly to metadata pools (we have a
> problem with filling up this cluster and causing lingering metadata
> problems, and this will guarantee space for metadata operations).
>
> Depending on the workload and metadata size: this might be a bad idea
> as it reduces parallelism.
>

We will still have 13 OSDs, it will be overkill for space for metadata, but
since Ceph lacks a reserve space feature, we don't have  many options. This
cluster is so fast that it can fill up in the blink of an eye.

> In the past, we have done this the old-school way of creating a separate
> root, but I wanted to see if we could leverage the device class function
> instead.
> >
> > Right now all our devices show as ssd rather than nvme, but that is the
> only class in this cluster. None of the device classes were manually set,
> so is there a reason they were not detected as nvme?
>
> Ceph only distinguishes rotational vs. non-rotational for device
> classes and device-specific configuration options
>

I read https://ceph.com/community/new-luminous-crush-device-classes/ and it
mentions that Ceph automatically classifies into hdd, ssd, and nvme. Hence
the question.

> Is it possible to add a new device class like 'metadata'?
>
> yes, it's just a string, you can use whatever you want. For example,
> we run a few setups that distinguish 2.5" and 3.5" HDDs.
>
>
> >
> > If I set the device class manually, will it be overwritten when the OSD
> boots up?
>
> not sure about this, we run all of our setups with auto-updating on
> start disabled
>

Do you know if 'osd crush location hook' can specify the device class?

> Is what I'm trying to accomplish better done by the old-school separate
> root and the osd_crush_location_hook (potentially using a file with a list
> of partition UUIDs that should be in the metadata pool).?
>
> device classes are the way to go here


Thanks!

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New CRUSH device class questions

2019-08-06 Thread Paul Emmerich
On Tue, Aug 6, 2019 at 7:45 PM Robert LeBlanc  wrote:
> We have a 12.2.8 luminous cluster with all NVMe and we want to take some of 
> the NVMe OSDs and allocate them strictly to metadata pools (we have a problem 
> with filling up this cluster and causing lingering metadata problems, and 
> this will guarantee space for metadata operations).

Depending on the workload and metadata size: this might be a bad idea
as it reduces parallelism.


> In the past, we have done this the old-school way of creating a separate 
> root, but I wanted to see if we could leverage the device class function 
> instead.
>
> Right now all our devices show as ssd rather than nvme, but that is the only 
> class in this cluster. None of the device classes were manually set, so is 
> there a reason they were not detected as nvme?

Ceph only distinguishes rotational vs. non-rotational for device
classes and device-specific configuration options

>
> Is it possible to add a new device class like 'metadata'?

yes, it's just a string, you can use whatever you want. For example,
we run a few setups that distinguish 2.5" and 3.5" HDDs.


>
> If I set the device class manually, will it be overwritten when the OSD boots 
> up?

not sure about this, we run all of our setups with auto-updating on
start disabled

> Is what I'm trying to accomplish better done by the old-school separate root 
> and the osd_crush_location_hook (potentially using a file with a list of 
> partition UUIDs that should be in the metadata pool).?

device classes are the way to go here

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


> Thank you,
> Robert LeBlanc
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] New CRUSH device class questions

2019-08-06 Thread Robert LeBlanc
We have a 12.2.8 luminous cluster with all NVMe and we want to take some of
the NVMe OSDs and allocate them strictly to metadata pools (we have a
problem with filling up this cluster and causing lingering metadata
problems, and this will guarantee space for metadata operations). In the
past, we have done this the old-school way of creating a separate root, but
I wanted to see if we could leverage the device class function instead.

Right now all our devices show as ssd rather than nvme, but that is the
only class in this cluster. None of the device classes were manually set,
so is there a reason they were not detected as nvme?

Is it possible to add a new device class like 'metadata'?

If I set the device class manually, will it be overwritten when the OSD
boots up?

Is what I'm trying to accomplish better done by the old-school separate
root and the osd_crush_location_hook (potentially using a file with a list
of partition UUIDs that should be in the metadata pool).?

Any other options I may not be considering?

Thank you,
Robert LeBlanc

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com