Re: [ceph-users] Question: CephFS + Bluestore

2018-05-11 Thread David Turner
That's right. I didn't actually use Jewel for very long. I'm glad it worked
for you.

On Fri, May 11, 2018, 4:49 PM Webert de Souza Lima 
wrote:

> Thanks David.
> Although you mentioned this was introduced with Luminous, it's working
> with Jewel.
>
> ~# ceph osd pool stats
>
> Fri May 11 17:41:39 2018
>
> pool rbd id 5
>   client io 505 kB/s rd, 3801 kB/s wr, 46 op/s rd, 27 op/s wr
>
> pool rbd_cache id 6
>   client io 2538 kB/s rd, 3070 kB/s wr, 601 op/s rd, 758 op/s wr
>   cache tier io 12225 kB/s flush, 0 op/s promote, 3 PG(s) flushing
>
> pool cephfs_metadata id 7
>   client io 2233 kB/s rd, 2260 kB/s wr, 95 op/s rd, 587 op/s wr
>
> pool cephfs_data_ssd id 8
>   client io 1126 kB/s rd, 94897 B/s wr, 33 op/s rd, 42 op/s wr
>
> pool cephfs_data id 9
>   client io 0 B/s rd, 11203 kB/s wr, 12 op/s rd, 12 op/s wr
>
> pool cephfs_data_cache id 10
>   client io 4383 kB/s rd, 550 kB/s wr, 57 op/s rd, 39 op/s wr
>   cache tier io 7012 kB/s flush, 4399 kB/s evict, 11 op/s promote
>
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
> *IRC NICK - WebertRLZ*
>
>
> On Fri, May 11, 2018 at 5:14 PM David Turner 
> wrote:
>
>> `ceph osd pool stats` with the option to specify the pool you are
>> interested in should get you the breakdown of IO per pool.  This was
>> introduced with luminous.
>>
>> On Fri, May 11, 2018 at 2:39 PM Webert de Souza Lima <
>> webert.b...@gmail.com> wrote:
>>
>>> I think ceph doesn't have IO metrics will filters by pool right? I see
>>> IO metrics from clients only:
>>>
>>> ceph_client_io_ops
>>> ceph_client_io_read_bytes
>>> ceph_client_io_read_ops
>>> ceph_client_io_write_bytes
>>> ceph_client_io_write_ops
>>>
>>> and pool "byte" metrics, but not "io":
>>>
>>> ceph_pool(write/read)_bytes(_total)
>>>
>>> Regards,
>>>
>>> Webert Lima
>>> DevOps Engineer at MAV Tecnologia
>>> *Belo Horizonte - Brasil*
>>> *IRC NICK - WebertRLZ*
>>>
>>> On Wed, May 9, 2018 at 2:23 PM Webert de Souza Lima <
>>> webert.b...@gmail.com> wrote:
>>>
 Hey Jon!

 On Wed, May 9, 2018 at 12:11 PM, John Spray  wrote:

> It depends on the metadata intensity of your workload.  It might be
> quite interesting to gather some drive stats on how many IOPS are
> currently hitting your metadata pool over a week of normal activity.
>

 Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
 sure what I should be looking at).
 My current SSD disks have 2 partitions.
  - One is used for cephfs cache tier pool,
  - The other is used for both:  cephfs meta-data pool and cephfs
 data-ssd (this is an additional cephfs data pool with only ssds with file
 layout for a specific direcotory to use it)

 Because of this, iostat shows me peaks of 12k IOPS in the metadata
 partition, but this could definitely be IO for the data-ssd pool.


> If you are doing large file workloads, and the metadata mostly fits in
> RAM, then the number of IOPS from the MDS can be very, very low.  On
> the other hand, if you're doing random metadata reads from a small
> file workload where the metadata does not fit in RAM, almost every
> client read could generate a read operation, and each MDS could easily
> generate thousands of ops per second.
>

 I have yet to measure it the right way but I'd assume my metadata fits
 in RAM (a few 100s of MB only).

 This is an email hosting cluster with dozens of thousands of users so
 there are a lot of random reads and writes, but not too many small files.
 Email messages are concatenated together in files up to 4MB in size
 (when a rotation happens).
 Most user operations are dovecot's INDEX operations and I will keep
 index directory in a SSD-dedicaded pool.



> Isolating metadata OSDs is useful if the data OSDs are going to be
> completely saturated: metadata performance will be protected even if
> clients are hitting the data OSDs hard.
>

 This seems to be the case.


> If "heavy write" means completely saturating the cluster, then sharing
> the OSDs is risky.  If "heavy write" just means that there are more
> writes than reads, then it may be fine if the metadata workload is not
> heavy enough to make good use of SSDs.
>

 Saturarion will only happen in peak workloads, not often. By heavy
 write I mean there are much more writes than reads, yes.
 So I think I can start sharing the OSDs, if I think this is impacting
 performance I can just change the ruleset and move metadata to a SSD-only
 pool, right?


> The way I'd summarise this is: in the general case, dedicated SSDs are
> the safe way to go -- they're intrinsically better suited to metadata.
> However, in some quite common special cases, the overall 

Re: [ceph-users] Question: CephFS + Bluestore

2018-05-11 Thread Webert de Souza Lima
Thanks David.
Although you mentioned this was introduced with Luminous, it's working with
Jewel.

~# ceph osd pool stats

Fri May 11 17:41:39 2018

pool rbd id 5
  client io 505 kB/s rd, 3801 kB/s wr, 46 op/s rd, 27 op/s wr

pool rbd_cache id 6
  client io 2538 kB/s rd, 3070 kB/s wr, 601 op/s rd, 758 op/s wr
  cache tier io 12225 kB/s flush, 0 op/s promote, 3 PG(s) flushing

pool cephfs_metadata id 7
  client io 2233 kB/s rd, 2260 kB/s wr, 95 op/s rd, 587 op/s wr

pool cephfs_data_ssd id 8
  client io 1126 kB/s rd, 94897 B/s wr, 33 op/s rd, 42 op/s wr

pool cephfs_data id 9
  client io 0 B/s rd, 11203 kB/s wr, 12 op/s rd, 12 op/s wr

pool cephfs_data_cache id 10
  client io 4383 kB/s rd, 550 kB/s wr, 57 op/s rd, 39 op/s wr
  cache tier io 7012 kB/s flush, 4399 kB/s evict, 11 op/s promote


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Fri, May 11, 2018 at 5:14 PM David Turner  wrote:

> `ceph osd pool stats` with the option to specify the pool you are
> interested in should get you the breakdown of IO per pool.  This was
> introduced with luminous.
>
> On Fri, May 11, 2018 at 2:39 PM Webert de Souza Lima <
> webert.b...@gmail.com> wrote:
>
>> I think ceph doesn't have IO metrics will filters by pool right? I see IO
>> metrics from clients only:
>>
>> ceph_client_io_ops
>> ceph_client_io_read_bytes
>> ceph_client_io_read_ops
>> ceph_client_io_write_bytes
>> ceph_client_io_write_ops
>>
>> and pool "byte" metrics, but not "io":
>>
>> ceph_pool(write/read)_bytes(_total)
>>
>> Regards,
>>
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> *Belo Horizonte - Brasil*
>> *IRC NICK - WebertRLZ*
>>
>> On Wed, May 9, 2018 at 2:23 PM Webert de Souza Lima <
>> webert.b...@gmail.com> wrote:
>>
>>> Hey Jon!
>>>
>>> On Wed, May 9, 2018 at 12:11 PM, John Spray  wrote:
>>>
 It depends on the metadata intensity of your workload.  It might be
 quite interesting to gather some drive stats on how many IOPS are
 currently hitting your metadata pool over a week of normal activity.

>>>
>>> Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
>>> sure what I should be looking at).
>>> My current SSD disks have 2 partitions.
>>>  - One is used for cephfs cache tier pool,
>>>  - The other is used for both:  cephfs meta-data pool and cephfs
>>> data-ssd (this is an additional cephfs data pool with only ssds with file
>>> layout for a specific direcotory to use it)
>>>
>>> Because of this, iostat shows me peaks of 12k IOPS in the metadata
>>> partition, but this could definitely be IO for the data-ssd pool.
>>>
>>>
 If you are doing large file workloads, and the metadata mostly fits in
 RAM, then the number of IOPS from the MDS can be very, very low.  On
 the other hand, if you're doing random metadata reads from a small
 file workload where the metadata does not fit in RAM, almost every
 client read could generate a read operation, and each MDS could easily
 generate thousands of ops per second.

>>>
>>> I have yet to measure it the right way but I'd assume my metadata fits
>>> in RAM (a few 100s of MB only).
>>>
>>> This is an email hosting cluster with dozens of thousands of users so
>>> there are a lot of random reads and writes, but not too many small files.
>>> Email messages are concatenated together in files up to 4MB in size
>>> (when a rotation happens).
>>> Most user operations are dovecot's INDEX operations and I will keep
>>> index directory in a SSD-dedicaded pool.
>>>
>>>
>>>
 Isolating metadata OSDs is useful if the data OSDs are going to be
 completely saturated: metadata performance will be protected even if
 clients are hitting the data OSDs hard.

>>>
>>> This seems to be the case.
>>>
>>>
 If "heavy write" means completely saturating the cluster, then sharing
 the OSDs is risky.  If "heavy write" just means that there are more
 writes than reads, then it may be fine if the metadata workload is not
 heavy enough to make good use of SSDs.

>>>
>>> Saturarion will only happen in peak workloads, not often. By heavy write
>>> I mean there are much more writes than reads, yes.
>>> So I think I can start sharing the OSDs, if I think this is impacting
>>> performance I can just change the ruleset and move metadata to a SSD-only
>>> pool, right?
>>>
>>>
 The way I'd summarise this is: in the general case, dedicated SSDs are
 the safe way to go -- they're intrinsically better suited to metadata.
 However, in some quite common special cases, the overall number of
 metadata ops is so low that the device doesn't matter.
>>>
>>>
>>>
>>> Thank you very much John!
>>> Webert Lima
>>> DevOps Engineer at MAV Tecnologia
>>> Belo Horizonte - Brasil
>>> IRC NICK - WebertRLZ
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> 

Re: [ceph-users] Question: CephFS + Bluestore

2018-05-11 Thread David Turner
`ceph osd pool stats` with the option to specify the pool you are
interested in should get you the breakdown of IO per pool.  This was
introduced with luminous.

On Fri, May 11, 2018 at 2:39 PM Webert de Souza Lima 
wrote:

> I think ceph doesn't have IO metrics will filters by pool right? I see IO
> metrics from clients only:
>
> ceph_client_io_ops
> ceph_client_io_read_bytes
> ceph_client_io_read_ops
> ceph_client_io_write_bytes
> ceph_client_io_write_ops
>
> and pool "byte" metrics, but not "io":
>
> ceph_pool(write/read)_bytes(_total)
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
> *IRC NICK - WebertRLZ*
>
> On Wed, May 9, 2018 at 2:23 PM Webert de Souza Lima 
> wrote:
>
>> Hey Jon!
>>
>> On Wed, May 9, 2018 at 12:11 PM, John Spray  wrote:
>>
>>> It depends on the metadata intensity of your workload.  It might be
>>> quite interesting to gather some drive stats on how many IOPS are
>>> currently hitting your metadata pool over a week of normal activity.
>>>
>>
>> Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
>> sure what I should be looking at).
>> My current SSD disks have 2 partitions.
>>  - One is used for cephfs cache tier pool,
>>  - The other is used for both:  cephfs meta-data pool and cephfs data-ssd
>> (this is an additional cephfs data pool with only ssds with file layout for
>> a specific direcotory to use it)
>>
>> Because of this, iostat shows me peaks of 12k IOPS in the metadata
>> partition, but this could definitely be IO for the data-ssd pool.
>>
>>
>>> If you are doing large file workloads, and the metadata mostly fits in
>>> RAM, then the number of IOPS from the MDS can be very, very low.  On
>>> the other hand, if you're doing random metadata reads from a small
>>> file workload where the metadata does not fit in RAM, almost every
>>> client read could generate a read operation, and each MDS could easily
>>> generate thousands of ops per second.
>>>
>>
>> I have yet to measure it the right way but I'd assume my metadata fits in
>> RAM (a few 100s of MB only).
>>
>> This is an email hosting cluster with dozens of thousands of users so
>> there are a lot of random reads and writes, but not too many small files.
>> Email messages are concatenated together in files up to 4MB in size (when
>> a rotation happens).
>> Most user operations are dovecot's INDEX operations and I will keep index
>> directory in a SSD-dedicaded pool.
>>
>>
>>
>>> Isolating metadata OSDs is useful if the data OSDs are going to be
>>> completely saturated: metadata performance will be protected even if
>>> clients are hitting the data OSDs hard.
>>>
>>
>> This seems to be the case.
>>
>>
>>> If "heavy write" means completely saturating the cluster, then sharing
>>> the OSDs is risky.  If "heavy write" just means that there are more
>>> writes than reads, then it may be fine if the metadata workload is not
>>> heavy enough to make good use of SSDs.
>>>
>>
>> Saturarion will only happen in peak workloads, not often. By heavy write
>> I mean there are much more writes than reads, yes.
>> So I think I can start sharing the OSDs, if I think this is impacting
>> performance I can just change the ruleset and move metadata to a SSD-only
>> pool, right?
>>
>>
>>> The way I'd summarise this is: in the general case, dedicated SSDs are
>>> the safe way to go -- they're intrinsically better suited to metadata.
>>> However, in some quite common special cases, the overall number of
>>> metadata ops is so low that the device doesn't matter.
>>
>>
>>
>> Thank you very much John!
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> Belo Horizonte - Brasil
>> IRC NICK - WebertRLZ
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question: CephFS + Bluestore

2018-05-11 Thread Webert de Souza Lima
I think ceph doesn't have IO metrics will filters by pool right? I see IO
metrics from clients only:

ceph_client_io_ops
ceph_client_io_read_bytes
ceph_client_io_read_ops
ceph_client_io_write_bytes
ceph_client_io_write_ops

and pool "byte" metrics, but not "io":

ceph_pool(write/read)_bytes(_total)

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, May 9, 2018 at 2:23 PM Webert de Souza Lima 
wrote:

> Hey Jon!
>
> On Wed, May 9, 2018 at 12:11 PM, John Spray  wrote:
>
>> It depends on the metadata intensity of your workload.  It might be
>> quite interesting to gather some drive stats on how many IOPS are
>> currently hitting your metadata pool over a week of normal activity.
>>
>
> Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
> sure what I should be looking at).
> My current SSD disks have 2 partitions.
>  - One is used for cephfs cache tier pool,
>  - The other is used for both:  cephfs meta-data pool and cephfs data-ssd
> (this is an additional cephfs data pool with only ssds with file layout for
> a specific direcotory to use it)
>
> Because of this, iostat shows me peaks of 12k IOPS in the metadata
> partition, but this could definitely be IO for the data-ssd pool.
>
>
>> If you are doing large file workloads, and the metadata mostly fits in
>> RAM, then the number of IOPS from the MDS can be very, very low.  On
>> the other hand, if you're doing random metadata reads from a small
>> file workload where the metadata does not fit in RAM, almost every
>> client read could generate a read operation, and each MDS could easily
>> generate thousands of ops per second.
>>
>
> I have yet to measure it the right way but I'd assume my metadata fits in
> RAM (a few 100s of MB only).
>
> This is an email hosting cluster with dozens of thousands of users so
> there are a lot of random reads and writes, but not too many small files.
> Email messages are concatenated together in files up to 4MB in size (when
> a rotation happens).
> Most user operations are dovecot's INDEX operations and I will keep index
> directory in a SSD-dedicaded pool.
>
>
>
>> Isolating metadata OSDs is useful if the data OSDs are going to be
>> completely saturated: metadata performance will be protected even if
>> clients are hitting the data OSDs hard.
>>
>
> This seems to be the case.
>
>
>> If "heavy write" means completely saturating the cluster, then sharing
>> the OSDs is risky.  If "heavy write" just means that there are more
>> writes than reads, then it may be fine if the metadata workload is not
>> heavy enough to make good use of SSDs.
>>
>
> Saturarion will only happen in peak workloads, not often. By heavy write I
> mean there are much more writes than reads, yes.
> So I think I can start sharing the OSDs, if I think this is impacting
> performance I can just change the ruleset and move metadata to a SSD-only
> pool, right?
>
>
>> The way I'd summarise this is: in the general case, dedicated SSDs are
>> the safe way to go -- they're intrinsically better suited to metadata.
>> However, in some quite common special cases, the overall number of
>> metadata ops is so low that the device doesn't matter.
>
>
>
> Thank you very much John!
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> Belo Horizonte - Brasil
> IRC NICK - WebertRLZ
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question: CephFS + Bluestore

2018-05-09 Thread Webert de Souza Lima
Hey Jon!

On Wed, May 9, 2018 at 12:11 PM, John Spray  wrote:

> It depends on the metadata intensity of your workload.  It might be
> quite interesting to gather some drive stats on how many IOPS are
> currently hitting your metadata pool over a week of normal activity.
>

Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
sure what I should be looking at).
My current SSD disks have 2 partitions.
 - One is used for cephfs cache tier pool,
 - The other is used for both:  cephfs meta-data pool and cephfs data-ssd
(this is an additional cephfs data pool with only ssds with file layout for
a specific direcotory to use it)

Because of this, iostat shows me peaks of 12k IOPS in the metadata
partition, but this could definitely be IO for the data-ssd pool.


> If you are doing large file workloads, and the metadata mostly fits in
> RAM, then the number of IOPS from the MDS can be very, very low.  On
> the other hand, if you're doing random metadata reads from a small
> file workload where the metadata does not fit in RAM, almost every
> client read could generate a read operation, and each MDS could easily
> generate thousands of ops per second.
>

I have yet to measure it the right way but I'd assume my metadata fits in
RAM (a few 100s of MB only).

This is an email hosting cluster with dozens of thousands of users so there
are a lot of random reads and writes, but not too many small files.
Email messages are concatenated together in files up to 4MB in size (when a
rotation happens).
Most user operations are dovecot's INDEX operations and I will keep index
directory in a SSD-dedicaded pool.



> Isolating metadata OSDs is useful if the data OSDs are going to be
> completely saturated: metadata performance will be protected even if
> clients are hitting the data OSDs hard.
>

This seems to be the case.


> If "heavy write" means completely saturating the cluster, then sharing
> the OSDs is risky.  If "heavy write" just means that there are more
> writes than reads, then it may be fine if the metadata workload is not
> heavy enough to make good use of SSDs.
>

Saturarion will only happen in peak workloads, not often. By heavy write I
mean there are much more writes than reads, yes.
So I think I can start sharing the OSDs, if I think this is impacting
performance I can just change the ruleset and move metadata to a SSD-only
pool, right?


> The way I'd summarise this is: in the general case, dedicated SSDs are
> the safe way to go -- they're intrinsically better suited to metadata.
> However, in some quite common special cases, the overall number of
> metadata ops is so low that the device doesn't matter.



Thank you very much John!
Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question: CephFS + Bluestore

2018-05-09 Thread John Spray
On Wed, May 9, 2018 at 3:32 PM, Webert de Souza Lima
 wrote:
> Hello,
>
> Currently, I run Jewel + Filestore for cephfs, with SSD-only pools used for
> cephfs-metadata, and HDD-only pools for cephfs-data. The current
> metadata/data ratio is something like 0,25% (50GB metadata for 20TB data).
>
> Regarding bluestore architecture, assuming I have:
>
>  - SSDs for WAL+DB
>  - Spinning Disks for bluestore data.
>
> would you recommend still store metadata in SSD-Only OSD nodes?

It depends on the metadata intensity of your workload.  It might be
quite interesting to gather some drive stats on how many IOPS are
currently hitting your metadata pool over a week of normal activity.

The primary reason for using SSDs for metadata is the cost-per-IOP.
SSDs are generally cheaper per operation than HDDs, so if you've got
enough IOPS to occupy an SSD then it's a no-brainer cost saving to use
SSDs (performance benefits are just a bonus).

If you are doing large file workloads, and the metadata mostly fits in
RAM, then the number of IOPS from the MDS can be very, very low.  On
the other hand, if you're doing random metadata reads from a small
file workload where the metadata does not fit in RAM, almost every
client read could generate a read operation, and each MDS could easily
generate thousands of ops per second.

> If not, is it recommended to dedicate some OSDs (Spindle+SSD for WAL/DB) for
> cephfs-metadata?

Isolating metadata OSDs is useful if the data OSDs are going to be
completely saturated: metadata performance will be protected even if
clients are hitting the data OSDs hard.

However, if your OSDs outnumber clients such that the clients couldn't
possibly saturate the OSDs, then you don't have this issue.

> If I just have 2 pools (metadata and data) all sharing the same OSDs in the
> cluster, would it be enough for heavy-write cases?

If "heavy write" means completely saturating the cluster, then sharing
the OSDs is risky.  If "heavy write" just means that there are more
writes than reads, then it may be fine if the metadata workload is not
heavy enough to make good use of SSDs.

The way I'd summarise this is: in the general case, dedicated SSDs are
the safe way to go -- they're intrinsically better suited to metadata.
However, in some quite common special cases, the overall number of
metadata ops is so low that the device doesn't matter.

John

> Assuming min_size=2, size=3.
>
> Thanks for your thoughts.
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> Belo Horizonte - Brasil
> IRC NICK - WebertRLZ
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com