[ceph-users] nfs-ganesha FSAL CephFS: nfs_health :DBUS :WARN :Health status is unhealthy

2018-09-10 Thread Kevin Olbrich
Hi!

Today one of our nfs-ganesha gateway experienced an outage and since crashs
every time, the client behind it tries to access the data.
This is a Ceph Mimic cluster with nfs-ganesha from ceph-repos:

nfs-ganesha-2.6.2-0.1.el7.x86_64
nfs-ganesha-ceph-2.6.2-0.1.el7.x86_64

There were fixes for this problem in 2.6.3:
https://github.com/nfs-ganesha/nfs-ganesha/issues/339

Can the build in the repos be compiled against this bugfix release?

Thank you very much.

Kind regards
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] omap vs. xattr in librados

2018-09-10 Thread Benjamin Cherian
Hi,

I'm interested in writing a relatively simple application that would use
librados for storage. Are there recommendations for when to use the omap as
opposed to an xattr? In theory, you could use either a set of xattrs or an
omap as a kv store associated with a specific object. Are there
recommendations for what kind of data xattrs and omaps are intended to
store?

Just for background, I have some metadata i'd like to associate with each
object (total size of all kv pairs in object metadata is ~250k, some values
a few bytes, while others are 10-20k.) The object will store actual data (a
relatively large FP array) as a binary blob (~3-5 MB).

Thanks,
Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore DB size and onode count

2018-09-10 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark 
> Nelson
> Sent: 10 September 2018 18:27
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Bluestore DB size and onode count
> 
> On 09/10/2018 12:22 PM, Igor Fedotov wrote:
> 
> > Hi Nick.
> >
> >
> > On 9/10/2018 1:30 PM, Nick Fisk wrote:
> >> If anybody has 5 minutes could they just clarify a couple of things
> >> for me
> >>
> >> 1. onode count, should this be equal to the number of objects stored
> >> on the OSD?
> >> Through reading several posts, there seems to be a general indication
> >> that this is the case, but looking at my OSD's the maths don't
> >> work.
> > onode_count is the number of onodes in the cache, not the total number
> > of onodes at an OSD.
> > Hence the difference...

Ok, thanks, that makes sense. I assume there isn't actually a counter which 
gives you the total number of objects on an OSD then?

> >>
> >> Eg.
> >> ceph osd df
> >> ID CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL  %USE  VAR  PGS
> >>   0   hdd 2.73679  1.0 2802G  1347G  1454G 48.09 0.69 115
> >>
> >> So 3TB OSD, roughly half full. This is pure RBD workload (no
> >> snapshots or anything clever) so let's assume worse case scenario of
> >> 4MB objects (Compression is on however, which would only mean more
> >> objects for given size)
> >> 1347000/4=~336750 expected objects
> >>
> >> sudo ceph daemon osd.0 perf dump | grep blue
> >>  "bluefs": {
> >>  "bluestore": {
> >>  "bluestore_allocated": 1437813964800,
> >>  "bluestore_stored": 2326118994003,
> >>  "bluestore_compressed": 445228558486,
> >>  "bluestore_compressed_allocated": 547649159168,
> >>  "bluestore_compressed_original": 1437773843456,
> >>  "bluestore_onodes": 99022,
> >>  "bluestore_onode_hits": 18151499,
> >>  "bluestore_onode_misses": 4539604,
> >>  "bluestore_onode_shard_hits": 10596780,
> >>  "bluestore_onode_shard_misses": 4632238,
> >>  "bluestore_extents": 896365,
> >>  "bluestore_blobs": 861495,
> >>
> >> 99022 onodes, anyone care to enlighten me?
> >>
> >> 2. block.db Size
> >> sudo ceph daemon osd.0 perf dump | grep db
> >>  "db_total_bytes": 8587829248,
> >>  "db_used_bytes": 2375024640,
> >>
> >> 2.3GB=0.17% of data size. This seems a lot lower than the 1%
> >> recommendation (10GB for every 1TB) or 4% given in the official docs. I
> >> know that different workloads will have differing overheads and
> >> potentially smaller objects. But am I understanding these figures
> >> correctly as they seem dramatically lower?
> > Just in case - is slow_used_bytes equal to 0? Some DB data might
> > reside at slow device if spill over has happened. Which doesn't
> > require full DB volume to happen - that's by RocksDB's design.
> >
> > And recommended numbers are a bit... speculative. So it's quite
> > possible that you numbers are absolutely adequate.
> 
> FWIW, these are the numbers I came up with after examining the SST files
> generated under different workloads:
> 
> https://protect-eu.mimecast.com/s/7e0iCJq9Bh6pZCzILpy?domain=drive.google.com
> 

Thanks for your input Mark and Igor. Mark I can see your RBD figures aren't too 
far off mine, so all looks to be as expected then.

> >>
> >> Regards,
> >> Nick
> >>
> >> ___
> >> ceph-users mailing list
> >> mailto:ceph-users@lists.ceph.com
> >> https://protect-eu.mimecast.com/s/YtrdCKZVDUX8OTAS9XW?domain=lists.ceph.com
> >
> > ___
> > ceph-users mailing list
> > mailto:ceph-users@lists.ceph.com
> > https://protect-eu.mimecast.com/s/YtrdCKZVDUX8OTAS9XW?domain=lists.ceph.com
> 
> ___
> ceph-users mailing list
> mailto:ceph-users@lists.ceph.com
> https://protect-eu.mimecast.com/s/YtrdCKZVDUX8OTAS9XW?domain=lists.ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore DB size and onode count

2018-09-10 Thread Igor Fedotov




On 9/10/2018 8:26 PM, Mark Nelson wrote:

On 09/10/2018 12:22 PM, Igor Fedotov wrote:

Just in case - is slow_used_bytes equal to 0? Some DB data might 
reside at slow device if spill over has happened. Which doesn't 
require full DB volume to happen - that's by RocksDB's design.


And recommended numbers are a bit... speculative. So it's quite 
possible that you numbers are absolutely adequate.


FWIW, these are the numbers I came up with after examining the SST 
files generated under different workloads:


https://drive.google.com/file/d/1Ews2WR-y5k3TMToAm0ZDsm7Gf_fwvyFw/view?usp=sharing 



Sorry, Mark. Speculative is a bit too strong word... I meant that 
two-parameter sizing model describing such a complex system as Ceph 
might tend to produce quite inaccurate results often enough...




Regards,
Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd-nbd on CentOS

2018-09-10 Thread Ilya Dryomov
On Mon, Sep 10, 2018 at 7:46 PM David Turner  wrote:
>
> Now that you mention it, I remember those threads on the ML.  What happens if 
> you use --yes-i-really-mean-it to do those things and then later you try to 
> map an RBD with an older kernel for CentOS 7.3 or 7.4?  Will that mapping 
> fail because of the min-client-version of luminous set on the cluster while 
> allowing CentOS 7.5 clients map RBDs?

Yes, more or less.

If you _just_ set the require-min-compat-client setting, nothing will
change.  It's there to prevent you from accidentally locking out older
clients by enabling some new feature.  You will continue to be able to
map images with both old and new kernels.

If you then go ahead and install an upmap exception (manually or via
the balancer module), you will no longer be able to map images with old
kernels.

This applies to all RADOS clients, not just the kernel client.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] data corruption issue with "rbd export-diff/import-diff"

2018-09-10 Thread Patrick.Mclean
On 2018-09-10 11:04:20-07:00 Jason Dillaman wrote:


On Mon, Sep 10, 2018 at 1:35 PM  wrote:
> We utilize Ceph RBDs for our users' storage and need to keep data 
> synchronized across data centres. For this we rely on 'rbd export-diff / 
> import-diff'. Lately we have been noticing cases in which the file system on 
> the 'destination RBD' is corrupt. We have been trying to isolate the issue, 
> which may or may not be due to Ceph. We suspect the problem could be in 'rbd 
> export-diff / import-diff' and are wondering if people have been seeing 
> issues with these tools. Let me explain our use case and issue in more detail.
> We have a number of data centres each with a Ceph cluster storing tens of 
> thousands of RBDs. We maintain extra copies of each RBD in other data 
> centres. After we are 'done' using a RBD, we create a snapshot and use 'rbd 
> export-diff' to create a diff between the most recent 'common' snapshot at 
> the other data center. We send the data over the network, and use 'rbd 
> import-diff' on the destination. When we apply a diff to a destination RBD we 
> can guarantee its 'HEAD' is clean. Of course we guarantee that an RBD is only 
> used in one data centre at a time.
> We noticed corruption at the destination RBD based on fsck failures, further 
> investigation showed that checksums on the RBD mismatch as well. Somehow the 
> data is sometimes getting corrupted either by our software or 'rbd 
> export-diff / import-diff'. Our investigation suggests that the the problem 
> is in 'rbd export-diff/import-diff'. The main evidence of this is that 
> occasionally we sync an RBD between multiple data centres. Each sync is a 
> separate job with its own 'rbd export-diff'. We noticed that both destination 
> locations have the same corruption (and the same checksum) and the source is 
> healthy.

Any chance you are using OSD tiering on your RBD pool? The
export-diffs from a cache tier pool are almost guaranteed to be
corrupt if that's the case since the cache tier provides incorrect
object diff stats [1].

No, we are not using any OSD tiering in our pools.

> In addition to this, we are seeing a similar type of corruption in another 
> use case when we migrate RBDs and snapshots across pools. In this case we 
> clone a version of an RBD (e.g. HEAD-3) to a new pool and rely on 'rbd 
> export-diff/import-diff' to restore the last 3 snapshots on top. Here too we 
> see cases of fsck and RBD checksum failures.
> We maintain various metrics and logs. Looking back at our data we have seen 
> the issue at a small scale for a while on Jewel, but the frequency increased 
> recently. The timing may have coincided with a move to Luminous, but this may 
> be coincidence. We are currently on Ceph 12.2.5.
> We are wondering if people are experiencing similar issues with 'rbd 
> export-diff / import-diff'. I'm sure many people use it to keep backups in 
> sync. Since it is backups, many people may not inspect the data often. In our 
> use case, we use this mechanism to keep data in sync and actually need the 
> data in the other location often. We are wondering if anyone else has 
> encountered any issues, it's quite possible that many people may have this 
> issue, buts simply don't realize. We are likely hitting it much more 
> frequently due to the scale of our operation (tens of thousands of syncs a 
> day).

If you are able to recreate this reliably without tiering, it would
assist in debugging if you could capture RBD debug logs during the
export along w/ the LBA of the filesystem corruption to compare
against.


We haven't been able to reproduce this reliably as of yet, as of yet we haven't 
actually figured out the exact conditions that cause this to happen, we have 
just been seeing it happen on some percentage of export/import-diff operations.

We will investigate creating ways to create debug logs of the export 
operations, and capture LBAs of the filesystem corruption when it occurs.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] data corruption issue with "rbd export-diff/import-diff"

2018-09-10 Thread Jason Dillaman
On Mon, Sep 10, 2018 at 1:35 PM  wrote:
>
> Hi,
> We utilize Ceph RBDs for our users' storage and need to keep data 
> synchronized across data centres. For this we rely on 'rbd export-diff / 
> import-diff'. Lately we have been noticing cases in which the file system on 
> the 'destination RBD' is corrupt. We have been trying to isolate the issue, 
> which may or may not be due to Ceph. We suspect the problem could be in 'rbd 
> export-diff / import-diff' and are wondering if people have been seeing 
> issues with these tools. Let me explain our use case and issue in more detail.
> We have a number of data centres each with a Ceph cluster storing tens of 
> thousands of RBDs. We maintain extra copies of each RBD in other data 
> centres. After we are 'done' using a RBD, we create a snapshot and use 'rbd 
> export-diff' to create a diff between the most recent 'common' snapshot at 
> the other data center. We send the data over the network, and use 'rbd 
> import-diff' on the destination. When we apply a diff to a destination RBD we 
> can guarantee its 'HEAD' is clean. Of course we guarantee that an RBD is only 
> used in one data centre at a time.
> We noticed corruption at the destination RBD based on fsck failures, further 
> investigation showed that checksums on the RBD mismatch as well. Somehow the 
> data is sometimes getting corrupted either by our software or 'rbd 
> export-diff / import-diff'. Our investigation suggests that the the problem 
> is in 'rbd export-diff/import-diff'. The main evidence of this is that 
> occasionally we sync an RBD between multiple data centres. Each sync is a 
> separate job with its own 'rbd export-diff'. We noticed that both destination 
> locations have the same corruption (and the same checksum) and the source is 
> healthy.

Any chance you are using OSD tiering on your RBD pool? The
export-diffs from a cache tier pool are almost guaranteed to be
corrupt if that's the case since the cache tier provides incorrect
object diff stats [1].

> In addition to this, we are seeing a similar type of corruption in another 
> use case when we migrate RBDs and snapshots across pools. In this case we 
> clone a version of an RBD (e.g. HEAD-3) to a new pool and rely on 'rbd 
> export-diff/import-diff' to restore the last 3 snapshots on top. Here too we 
> see cases of fsck and RBD checksum failures.
> We maintain various metrics and logs. Looking back at our data we have seen 
> the issue at a small scale for a while on Jewel, but the frequency increased 
> recently. The timing may have coincided with a move to Luminous, but this may 
> be coincidence. We are currently on Ceph 12.2.5.
> We are wondering if people are experiencing similar issues with 'rbd 
> export-diff / import-diff'. I'm sure many people use it to keep backups in 
> sync. Since it is backups, many people may not inspect the data often. In our 
> use case, we use this mechanism to keep data in sync and actually need the 
> data in the other location often. We are wondering if anyone else has 
> encountered any issues, it's quite possible that many people may have this 
> issue, buts simply don't realize. We are likely hitting it much more 
> frequently due to the scale of our operation (tens of thousands of syncs a 
> day).

If you are able to recreate this reliably without tiering, it would
assist in debugging if you could capture RBD debug logs during the
export along w/ the LBA of the filesystem corruption to compare
against.

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[1] http://tracker.ceph.com/issues/20896

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] data corruption issue with "rbd export-diff/import-diff"

2018-09-10 Thread Patrick.Mclean
Hi,
We utilize Ceph RBDs for our users' storage and need to keep data synchronized 
across data centres. For this we rely on 'rbd export-diff / import-diff'. 
Lately we have been noticing cases in which the file system on the 'destination 
RBD' is corrupt. We have been trying to isolate the issue, which may or may not 
be due to Ceph. We suspect the problem could be in 'rbd export-diff / 
import-diff' and are wondering if people have been seeing issues with these 
tools. Let me explain our use case and issue in more detail.
We have a number of data centres each with a Ceph cluster storing tens of 
thousands of RBDs. We maintain extra copies of each RBD in other data centres. 
After we are 'done' using a RBD, we create a snapshot and use 'rbd export-diff' 
to create a diff between the most recent 'common' snapshot at the other data 
center. We send the data over the network, and use 'rbd import-diff' on the 
destination. When we apply a diff to a destination RBD we can guarantee its 
'HEAD' is clean. Of course we guarantee that an RBD is only used in one data 
centre at a time.
We noticed corruption at the destination RBD based on fsck failures, further 
investigation showed that checksums on the RBD mismatch as well. Somehow the 
data is sometimes getting corrupted either by our software or 'rbd export-diff 
/ import-diff'. Our investigation suggests that the the problem is in 'rbd 
export-diff/import-diff'. The main evidence of this is that occasionally we 
sync an RBD between multiple data centres. Each sync is a separate job with its 
own 'rbd export-diff'. We noticed that both destination locations have the same 
corruption (and the same checksum) and the source is healthy.
In addition to this, we are seeing a similar type of corruption in another use 
case when we migrate RBDs and snapshots across pools. In this case we clone a 
version of an RBD (e.g. HEAD-3) to a new pool and rely on 'rbd 
export-diff/import-diff' to restore the last 3 snapshots on top. Here too we 
see cases of fsck and RBD checksum failures.
We maintain various metrics and logs. Looking back at our data we have seen the 
issue at a small scale for a while on Jewel, but the frequency increased 
recently. The timing may have coincided with a move to Luminous, but this may 
be coincidence. We are currently on Ceph 12.2.5.
We are wondering if people are experiencing similar issues with 'rbd 
export-diff / import-diff'. I'm sure many people use it to keep backups in 
sync. Since it is backups, many people may not inspect the data often. In our 
use case, we use this mechanism to keep data in sync and actually need the data 
in the other location often. We are wondering if anyone else has encountered 
any issues, it's quite possible that many people may have this issue, buts 
simply don't realize. We are likely hitting it much more frequently due to the 
scale of our operation (tens of thousands of syncs a day).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd-nbd on CentOS

2018-09-10 Thread David Turner
Now that you mention it, I remember those threads on the ML.  What happens
if you use --yes-i-really-mean-it to do those things and then later you try
to map an RBD with an older kernel for CentOS 7.3 or 7.4?  Will that
mapping fail because of the min-client-version of luminous set on the
cluster while allowing CentOS 7.5 clients map RBDs?

On Mon, Sep 10, 2018 at 1:33 PM Ilya Dryomov  wrote:

> On Mon, Sep 10, 2018 at 7:19 PM David Turner 
> wrote:
> >
> > I haven't found any mention of this on the ML and Google's results are
> all about compiling your own kernel to use NBD on CentOS. Is everyone
> that's using rbd-nbd on CentOS honestly compiling their own kernels for the
> clients? This feels like something that shouldn't be necessary anymore.
> >
> > I would like to use the balancer module with upmap, but can't do that
> with kRBD because even the latest kernels still register as Jewel. What
> have y'all done to use rbd-nbd on CentOS? I'm hoping I'm missing something
> and not that I'll need to compile a kernel to use on all of the hosts that
> I want to map RBDs to.
>
> FWIW upmap is fully supported since 4.13 and RHEL 7.5:
>
>   https://www.spinics.net/lists/ceph-users/msg45071.html
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/029105.html
>
> Thanks,
>
> Ilya
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd-nbd on CentOS

2018-09-10 Thread Ilya Dryomov
On Mon, Sep 10, 2018 at 7:19 PM David Turner  wrote:
>
> I haven't found any mention of this on the ML and Google's results are all 
> about compiling your own kernel to use NBD on CentOS. Is everyone that's 
> using rbd-nbd on CentOS honestly compiling their own kernels for the clients? 
> This feels like something that shouldn't be necessary anymore.
>
> I would like to use the balancer module with upmap, but can't do that with 
> kRBD because even the latest kernels still register as Jewel. What have y'all 
> done to use rbd-nbd on CentOS? I'm hoping I'm missing something and not that 
> I'll need to compile a kernel to use on all of the hosts that I want to map 
> RBDs to.

FWIW upmap is fully supported since 4.13 and RHEL 7.5:

  https://www.spinics.net/lists/ceph-users/msg45071.html
  http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/029105.html

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore DB size and onode count

2018-09-10 Thread Mark Nelson

On 09/10/2018 12:22 PM, Igor Fedotov wrote:


Hi Nick.


On 9/10/2018 1:30 PM, Nick Fisk wrote:
If anybody has 5 minutes could they just clarify a couple of things 
for me


1. onode count, should this be equal to the number of objects stored 
on the OSD?
Through reading several posts, there seems to be a general indication 
that this is the case, but looking at my OSD's the maths don't

work.
onode_count is the number of onodes in the cache, not the total number 
of onodes at an OSD.

Hence the difference...


Eg.
ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE  USE    AVAIL  %USE  VAR  PGS
  0   hdd 2.73679  1.0 2802G  1347G  1454G 48.09 0.69 115

So 3TB OSD, roughly half full. This is pure RBD workload (no 
snapshots or anything clever) so let's assume worse case scenario of
4MB objects (Compression is on however, which would only mean more 
objects for given size)

1347000/4=~336750 expected objects

sudo ceph daemon osd.0 perf dump | grep blue
 "bluefs": {
 "bluestore": {
 "bluestore_allocated": 1437813964800,
 "bluestore_stored": 2326118994003,
 "bluestore_compressed": 445228558486,
 "bluestore_compressed_allocated": 547649159168,
 "bluestore_compressed_original": 1437773843456,
 "bluestore_onodes": 99022,
 "bluestore_onode_hits": 18151499,
 "bluestore_onode_misses": 4539604,
 "bluestore_onode_shard_hits": 10596780,
 "bluestore_onode_shard_misses": 4632238,
 "bluestore_extents": 896365,
 "bluestore_blobs": 861495,

99022 onodes, anyone care to enlighten me?

2. block.db Size
sudo ceph daemon osd.0 perf dump | grep db
 "db_total_bytes": 8587829248,
 "db_used_bytes": 2375024640,

2.3GB=0.17% of data size. This seems a lot lower than the 1% 
recommendation (10GB for every 1TB) or 4% given in the official docs. I
know that different workloads will have differing overheads and 
potentially smaller objects. But am I understanding these figures

correctly as they seem dramatically lower?
Just in case - is slow_used_bytes equal to 0? Some DB data might 
reside at slow device if spill over has happened. Which doesn't 
require full DB volume to happen - that's by RocksDB's design.


And recommended numbers are a bit... speculative. So it's quite 
possible that you numbers are absolutely adequate.


FWIW, these are the numbers I came up with after examining the SST files 
generated under different workloads:


https://drive.google.com/file/d/1Ews2WR-y5k3TMToAm0ZDsm7Gf_fwvyFw/view?usp=sharing



Regards,
Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore DB size and onode count

2018-09-10 Thread Igor Fedotov

Hi Nick.


On 9/10/2018 1:30 PM, Nick Fisk wrote:

If anybody has 5 minutes could they just clarify a couple of things for me

1. onode count, should this be equal to the number of objects stored on the OSD?
Through reading several posts, there seems to be a general indication that this 
is the case, but looking at my OSD's the maths don't
work.
onode_count is the number of onodes in the cache, not the total number 
of onodes at an OSD.

Hence the difference...


Eg.
ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL  %USE  VAR  PGS
  0   hdd 2.73679  1.0 2802G  1347G  1454G 48.09 0.69 115

So 3TB OSD, roughly half full. This is pure RBD workload (no snapshots or 
anything clever) so let's assume worse case scenario of
4MB objects (Compression is on however, which would only mean more objects for 
given size)
1347000/4=~336750 expected objects

sudo ceph daemon osd.0 perf dump | grep blue
 "bluefs": {
 "bluestore": {
 "bluestore_allocated": 1437813964800,
 "bluestore_stored": 2326118994003,
 "bluestore_compressed": 445228558486,
 "bluestore_compressed_allocated": 547649159168,
 "bluestore_compressed_original": 1437773843456,
 "bluestore_onodes": 99022,
 "bluestore_onode_hits": 18151499,
 "bluestore_onode_misses": 4539604,
 "bluestore_onode_shard_hits": 10596780,
 "bluestore_onode_shard_misses": 4632238,
 "bluestore_extents": 896365,
 "bluestore_blobs": 861495,

99022 onodes, anyone care to enlighten me?

2. block.db Size
sudo ceph daemon osd.0 perf dump | grep db
 "db_total_bytes": 8587829248,
 "db_used_bytes": 2375024640,

2.3GB=0.17% of data size. This seems a lot lower than the 1% recommendation 
(10GB for every 1TB) or 4% given in the official docs. I
know that different workloads will have differing overheads and potentially 
smaller objects. But am I understanding these figures
correctly as they seem dramatically lower?
Just in case - is slow_used_bytes equal to 0? Some DB data might reside 
at slow device if spill over has happened. Which doesn't require full DB 
volume to happen - that's by RocksDB's design.


And recommended numbers are a bit... speculative. So it's quite possible 
that you numbers are absolutely adequate.


Regards,
Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd-nbd on CentOS

2018-09-10 Thread David Turner
I haven't found any mention of this on the ML and Google's results are all
about compiling your own kernel to use NBD on CentOS. Is everyone that's
using rbd-nbd on CentOS honestly compiling their own kernels for the
clients? This feels like something that shouldn't be necessary anymore.

I would like to use the balancer module with upmap, but can't do that with
kRBD because even the latest kernels still register as Jewel. What have
y'all done to use rbd-nbd on CentOS? I'm hoping I'm missing something and
not that I'll need to compile a kernel to use on all of the hosts that I
want to map RBDs to.

Alternatively there's rbd-fuse, but in it's current state it's too slow for
me. There's a [1] PR for an update to rbd-fuse that is promising. I have
seen the custom version of this rbd-fuse in action and it's really
impressive on speed. It can pretty much keep pace with the kernel client.
However, even if that does get merged, it'll be quite a while before it's
back-ported into a release.

[1] https://github.com/ceph/ceph/pull/23270
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] tier monitoring

2018-09-10 Thread Fyodor Ustinov
Hi!

Does anyone have a recipe for monitoring of tiering pool?

Interested in such parameters as fullness, flush/evict/promote statistics and 
so on.

WBR,
Fyodor.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Need a procedure for corrupted pg_log repair using ceph-kvstore-tool

2018-09-10 Thread Maks Kowalik
Can someone provide information about what to look for (and how to modify
the related leveldb keys) in case of such error leading to OSD crash

-5> 2018-09-10 14:46:30.896130 7efff657dd00 20 read_log_and_missing
712021'566147 (656569'562061) delete
28:b2d84df6:::rbd_data.423c863d6f7d13.071c:head by
client.442854982.0:26349 2018-08-22 12:45:48.366430 0
-4> 2018-09-10 14:46:30.896135 7efff657dd00 20 read_log_and_missing
712021'566148 (396232'430937) modify
28:b2a8dfc4:::rbd_data.1279a2016dd7ff07.1715:head by
client.375380018.0:66926373 2018-08-22 13:53:42.891543 0
-3> 2018-09-10 14:46:30.896140 7efff657dd00 20 read_log_and_missing
712021'566149 (455388'436624) modify
28:b2e5c03b:::rbd_data.c3b0cd3fe98040.0dd1:head by
client.357924238.0:32177266 2018-08-22 12:40:20.290431 0
-2> 2018-09-10 14:46:30.896145 7efff657dd00 20 read_log_and_missing
712021'566150 (455452'436627) modify
28:b2be4e96:::rbd_data.c3b0cd3fe98040.0e8e:head by
client.357924238.0:32178303 2018-08-22 13:51:03.149459 0
-1> 2018-09-10 14:46:30.896153 7efff657dd00 20 read_log_and_missing
714416'1 (0'0) error
28:b2b68805:::rbd_data.516e3914fdc210.1993:head by
client.441544789.0:109624 0.00 -2
 0> 2018-09-10 14:46:30.897918 7efff657dd00 -1
/build/ceph-12.2.7/src/osd/PGLog.h: In function 'static void
PGLog::read_log_and_missing(ObjectStore*, coll_t, coll_t, ghobject_t, const
pg_info_t&, PGLog::IndexedLog&, missing_type&, bool, std::ostringstream&,
bool, bool*, const DoutPrefixProvider*, std::set
>*, bool) [with missing_type = pg_missing_set; std::ostringstream =
std::basic_ostringstream]' thread 7efff657dd00 time 2018-09-10
14:46:30.896158
/build/ceph-12.2.7/src/osd/PGLog.h: 1354: FAILED
assert(last_e.version.version < e.version.version)


The ceph version is 12.2.7, and the current problem is a consequence of
multiple crashes of numerous OSDs due to some other ceph error.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help

2018-09-10 Thread John Spray
(adding list back)

The "clients failing to respond to capability release" messages can
sometimes indicate a bug in the client code, so it's a good idea to
make sure you've got the most recent fixes before investigating
further.  It's also useful to compare kernel vs. fuse clients to see
if the issue occurs in one but not the other.

The guidance on client choice and kernel versions is here:
http://docs.ceph.com/docs/master/cephfs/best-practices/#which-client

If you're happy running a non-LTS distro like Fedora, then I'd suggest
running the latest Fedora release (28).

John
On Mon, Sep 10, 2018 at 3:17 PM marc-antoine desrochers
 wrote:
>
> What Is the advantages of using ceph-fuse ? and if I stay on kernel client 
> what kind of distro/kernel are you suggesting ?
>
> -Message d'origine-
> De : John Spray [mailto:jsp...@redhat.com]
> Envoyé : 10 septembre 2018 10:08
> À : marc-antoine.desroch...@sogetel.com
> Cc : ceph-users@lists.ceph.com
> Objet : Re: [ceph-users] Need help
>
> On Mon, Sep 10, 2018 at 1:40 PM marc-antoine desrochers 
>  wrote:
> >
> > Hi,
> >
> >
> >
> > I am currently running a ceph cluster running in CEPHFS with 3 nodes each 
> > have 6 osd’s except 1 who got 5. I got 3 mds : 2 active and 1 standby, 3 
> > mon.
> >
> >
> >
> >
> >
> > [root@ceph-n1 ~]# ceph -s
> >
> >   cluster:
> >
> > id: 1d97aa70-2029-463a-b6fa-20e98f3e21fb
> >
> > health: HEALTH_WARN
> >
> > 3 clients failing to respond to capability release
> >
> > 2 MDSs report slow requests
> >
> >
> >
> >   services:
> >
> > mon: 3 daemons, quorum ceph-n1,ceph-n2,ceph-n3
> >
> > mgr: ceph-n1(active), standbys: ceph-n2, ceph-n3
> >
> > mds: cephfs-2/2/2 up  {0=ceph-n1=up:active,1=ceph-n2=up:active}, 1
> > up:standby
> >
> > osd: 17 osds: 17 up, 17 in
> >
> >
> >
> >   data:
> >
> > pools:   2 pools, 1024 pgs
> >
> > objects: 541k objects, 42006 MB
> >
> > usage:   143 GB used, 6825 GB / 6969 GB avail
> >
> > pgs: 1024 active+clean
> >
> >
> >
> >   io:
> >
> > client:   32980 B/s rd, 77295 B/s wr, 5 op/s rd, 14 op/s wr
> >
> >
> >
> > I’m using the cephFs as a mail storage. I currently have 3500 mailbox
> > some of them are IMAP the others are POP3 the goal is to be able to
> > migrate all mailbox from my old
> >
> >
> >
> > infrastructure so around 30 000 mailbox.
> >
> >
> >
> > I’m now facing a problem :
> >
> > MDS_CLIENT_LATE_RELEASE 3 clients failing to respond to capability
> > release
> >
> > mdsceph-n1(mds.0): Client mda3.sogetel.net failing to respond to
> > capability releaseclient_id: 1134426
> >
> > mdsceph-n1(mds.0): Client mda2.sogetel.net failing to respond to
> > capability releaseclient_id: 1172391
> >
> > mdsceph-n2(mds.1): Client mda3.sogetel.net failing to respond to
> > capability releaseclient_id: 1134426
> >
> > MDS_SLOW_REQUEST 2 MDSs report slow requests
> >
> > mdsceph-n1(mds.0): 112 slow requests are blocked > 30 sec
> >
> > mdsceph-n2(mds.1): 323 slow requests are blocked > 30 sec
> >
> >
> >
> > I can’t figure out how to fix this…
> >
> >
> >
> >
> > Here some information’s about my cluster :
> >
> > I’m running ceph luminous 12.2.5 on my 3 ceph nodes : ceph-n1, ceph-n2, 
> > ceph-n3.
> >
> >
> > I have 3 client identical :
> >
> > LSB Version::core-4.1-amd64:core-4.1-noarch
> >
> > Distributor ID: Fedora
> >
> > Description:Fedora release 25 (Twenty Five)
> >
> > Release:25
> >
> > Codename:   TwentyFive
> >
>
> I can't say for sure whether it would help, but I'd definitely suggest 
> upgrading those nodes to latest Fedora if you're using the kernel client -- 
> Fedora 25 hasn't received updates for quite some time.
>
> John
>
> >
> > My ceph nodes :
> >
> >
> >
> > CentOS Linux release 7.5.1804 (Core)
> >
> > NAME="CentOS Linux"
> >
> > VERSION="7 (Core)"
> >
> > ID="centos"
> >
> > ID_LIKE="rhel fedora"
> >
> > VERSION_ID="7"
> >
> > PRETTY_NAME="CentOS Linux 7 (Core)"
> >
> > ANSI_COLOR="0;31"
> >
> > CPE_NAME="cpe:/o:centos:centos:7"
> >
> > HOME_URL="https://www.centos.org/";
> >
> > BUG_REPORT_URL="https://bugs.centos.org/";
> >
> >
> >
> > CENTOS_MANTISBT_PROJECT="CentOS-7"
> >
> > CENTOS_MANTISBT_PROJECT_VERSION="7"
> >
> > REDHAT_SUPPORT_PRODUCT="centos"
> >
> > REDHAT_SUPPORT_PRODUCT_VERSION="7"
> >
> >
> >
> > CentOS Linux release 7.5.1804 (Core)
> >
> > CentOS Linux release 7.5.1804 (Core)
> >
> >
> >
> > ceph daemon mds.ceph-n1 perf dump mds :
> >
> >
> >
> >
> >
> > "mds": {
> >
> > "request": 21968558,
> >
> > "reply": 21954801,
> >
> > "reply_latency": {
> >
> > "avgcount": 21954801,
> >
> > "sum": 100879.560315258,
> >
> > "avgtime": 0.004594874
> >
> > },
> >
> > "forward": 13627,
> >
> > "dir_fetch": 3327,
> >
> > "dir_commit": 162830,
> >
> > "dir_split": 1,
> >
> > "dir_merge": 0,
> >
> > "inode_max": 2147483647,
> >
> > "inodes"

Re: [ceph-users] Upgrade to Infernalis: OSDs crash all the time

2018-09-10 Thread Kees Meijs
Hi list,

A little update: meanwhile we added a new node consisting of Hammer OSDs
to ensure sufficient cluster capacity.

The upgraded node with Infernalis OSDs is completely removed from the
CRUSH map and the OSDs removed (obviously we didn't wipe the disks yet).

At the moment we're still running using flags
noout,nobackfill,noscrub,nodeep-scrub. Although now only Hammer OSDs
reside, we still experience OSD crashes on backfilling so we're unable
to achieve HEALTH_OK state.

Using debug 20 level we're (mostly my coworker Willem Jan is) figuring
out why the crashes happen exactly. Hopefully we'll figure it out.

To be continued...

Regards,
Kees

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help

2018-09-10 Thread John Spray
On Mon, Sep 10, 2018 at 1:40 PM marc-antoine desrochers
 wrote:
>
> Hi,
>
>
>
> I am currently running a ceph cluster running in CEPHFS with 3 nodes each 
> have 6 osd’s except 1 who got 5. I got 3 mds : 2 active and 1 standby, 3 mon.
>
>
>
>
>
> [root@ceph-n1 ~]# ceph -s
>
>   cluster:
>
> id: 1d97aa70-2029-463a-b6fa-20e98f3e21fb
>
> health: HEALTH_WARN
>
> 3 clients failing to respond to capability release
>
> 2 MDSs report slow requests
>
>
>
>   services:
>
> mon: 3 daemons, quorum ceph-n1,ceph-n2,ceph-n3
>
> mgr: ceph-n1(active), standbys: ceph-n2, ceph-n3
>
> mds: cephfs-2/2/2 up  {0=ceph-n1=up:active,1=ceph-n2=up:active}, 1 
> up:standby
>
> osd: 17 osds: 17 up, 17 in
>
>
>
>   data:
>
> pools:   2 pools, 1024 pgs
>
> objects: 541k objects, 42006 MB
>
> usage:   143 GB used, 6825 GB / 6969 GB avail
>
> pgs: 1024 active+clean
>
>
>
>   io:
>
> client:   32980 B/s rd, 77295 B/s wr, 5 op/s rd, 14 op/s wr
>
>
>
> I’m using the cephFs as a mail storage. I currently have 3500 mailbox some of 
> them are IMAP the others are POP3 the goal is to be able to migrate all 
> mailbox from my old
>
>
>
> infrastructure so around 30 000 mailbox.
>
>
>
> I’m now facing a problem :
>
> MDS_CLIENT_LATE_RELEASE 3 clients failing to respond to capability release
>
> mdsceph-n1(mds.0): Client mda3.sogetel.net failing to respond to 
> capability releaseclient_id: 1134426
>
> mdsceph-n1(mds.0): Client mda2.sogetel.net failing to respond to 
> capability releaseclient_id: 1172391
>
> mdsceph-n2(mds.1): Client mda3.sogetel.net failing to respond to 
> capability releaseclient_id: 1134426
>
> MDS_SLOW_REQUEST 2 MDSs report slow requests
>
> mdsceph-n1(mds.0): 112 slow requests are blocked > 30 sec
>
> mdsceph-n2(mds.1): 323 slow requests are blocked > 30 sec
>
>
>
> I can’t figure out how to fix this…
>
>
>
>
> Here some information’s about my cluster :
>
> I’m running ceph luminous 12.2.5 on my 3 ceph nodes : ceph-n1, ceph-n2, 
> ceph-n3.
>
>
> I have 3 client identical :
>
> LSB Version::core-4.1-amd64:core-4.1-noarch
>
> Distributor ID: Fedora
>
> Description:Fedora release 25 (Twenty Five)
>
> Release:25
>
> Codename:   TwentyFive
>

I can't say for sure whether it would help, but I'd definitely suggest
upgrading those nodes to latest Fedora if you're using the kernel
client -- Fedora 25 hasn't received updates for quite some time.

John

>
> My ceph nodes :
>
>
>
> CentOS Linux release 7.5.1804 (Core)
>
> NAME="CentOS Linux"
>
> VERSION="7 (Core)"
>
> ID="centos"
>
> ID_LIKE="rhel fedora"
>
> VERSION_ID="7"
>
> PRETTY_NAME="CentOS Linux 7 (Core)"
>
> ANSI_COLOR="0;31"
>
> CPE_NAME="cpe:/o:centos:centos:7"
>
> HOME_URL="https://www.centos.org/";
>
> BUG_REPORT_URL="https://bugs.centos.org/";
>
>
>
> CENTOS_MANTISBT_PROJECT="CentOS-7"
>
> CENTOS_MANTISBT_PROJECT_VERSION="7"
>
> REDHAT_SUPPORT_PRODUCT="centos"
>
> REDHAT_SUPPORT_PRODUCT_VERSION="7"
>
>
>
> CentOS Linux release 7.5.1804 (Core)
>
> CentOS Linux release 7.5.1804 (Core)
>
>
>
> ceph daemon mds.ceph-n1 perf dump mds :
>
>
>
>
>
> "mds": {
>
> "request": 21968558,
>
> "reply": 21954801,
>
> "reply_latency": {
>
> "avgcount": 21954801,
>
> "sum": 100879.560315258,
>
> "avgtime": 0.004594874
>
> },
>
> "forward": 13627,
>
> "dir_fetch": 3327,
>
> "dir_commit": 162830,
>
> "dir_split": 1,
>
> "dir_merge": 0,
>
> "inode_max": 2147483647,
>
> "inodes": 68767,
>
> "inodes_top": 4524,
>
> "inodes_bottom": 56697,
>
> "inodes_pin_tail": 7546,
>
> "inodes_pinned": 62304,
>
> "inodes_expired": 1640159,
>
> "inodes_with_caps": 62192,
>
> "caps": 114126,
>
> "subtrees": 14,
>
> "traverse": 38309963,
>
> "traverse_hit": 37606227,
>
> "traverse_forward": 12189,
>
> "traverse_discover": 6634,
>
> "traverse_dir_fetch": 1769,
>
> "traverse_remote_ino": 6,
>
> "traverse_lock": 7731,
>
> "load_cent": 2196856701,
>
> "q": 0,
>
> "exported": 143,
>
> "exported_inodes": 291372,
>
> "imported": 125,
>
> "imported_inodes": 176509
>
>
>
>
>
> Thanks for your help…
>
>
>
> Regards
>
>
>
> Marc-Antoine
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help

2018-09-10 Thread Burkhard Linke

Hi,


On 09/10/2018 02:40 PM, marc-antoine desrochers wrote:

Hi,

  


I am currently running a ceph cluster running in CEPHFS with 3 nodes each
have 6 osd's except 1 who got 5. I got 3 mds : 2 active and 1 standby, 3
mon.

  

  


[root@ceph-n1 ~]# ceph -s

   cluster:

 id: 1d97aa70-2029-463a-b6fa-20e98f3e21fb

 health: HEALTH_WARN

 3 clients failing to respond to capability release

 2 MDSs report slow requests


*snipsnap*


I'm now facing a problem :

MDS_CLIENT_LATE_RELEASE 3 clients failing to respond to capability release

 mdsceph-n1(mds.0): Client mda3.sogetel.net failing to respond to
capability releaseclient_id: 1134426

 mdsceph-n1(mds.0): Client mda2.sogetel.net failing to respond to
capability releaseclient_id: 1172391

 mdsceph-n2(mds.1): Client mda3.sogetel.net failing to respond to
capability releaseclient_id: 1134426

MDS_SLOW_REQUEST 2 MDSs report slow requests

 mdsceph-n1(mds.0): 112 slow requests are blocked > 30 sec

 mdsceph-n2(mds.1): 323 slow requests are blocked > 30 sec
The messages indicate that clients do not release capabilities for 
opened/cached files. These files are either accessed by other clients 
(and thus these other clients need to acquire the capabilities), or the 
MDS runs out of memory and tries to reduce the number of capabilities in 
his book keeping to reduce the memory footprint. In both cases the 
client request to open a file is blocked.


In case of the second problem, you can increase the mds cache size to 
allow it to store more inode and capability entries 
(mds_cache_memory_limit in ceph.conf). You should also try to figure out 
why the clients do not release the capabilities, e.g. whether they 
really have a large number of open/cached files.


Do you use ceph-fuse or the kernel based implementation?

Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help

2018-09-10 Thread Marc Roos
 
I guess good luck. Maybe you can ask these guys to hurry up and get 
something production ready.
https://github.com/ceph-dovecot/dovecot-ceph-plugin




-Original Message-
From: marc-antoine desrochers 
[mailto:marc-antoine.desroch...@sogetel.com] 
Sent: maandag 10 september 2018 14:40
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Need help

Hi,

 

I am currently running a ceph cluster running in CEPHFS with 3 nodes 
each have 6 osd’s except 1 who got 5. I got 3 mds : 2 active and 1 
standby, 3 mon.

 

 

[root@ceph-n1 ~]# ceph -s

  cluster:

id: 1d97aa70-2029-463a-b6fa-20e98f3e21fb

health: HEALTH_WARN

3 clients failing to respond to capability release

2 MDSs report slow requests

 

  services:

mon: 3 daemons, quorum ceph-n1,ceph-n2,ceph-n3

mgr: ceph-n1(active), standbys: ceph-n2, ceph-n3

mds: cephfs-2/2/2 up  {0=ceph-n1=up:active,1=ceph-n2=up:active}, 1 
up:standby

osd: 17 osds: 17 up, 17 in

 

  data:

pools:   2 pools, 1024 pgs

objects: 541k objects, 42006 MB

usage:   143 GB used, 6825 GB / 6969 GB avail

pgs: 1024 active+clean

 

  io:

client:   32980 B/s rd, 77295 B/s wr, 5 op/s rd, 14 op/s wr



I’m using the cephFs as a mail storage. I currently have 3500 mailbox 
some of them are IMAP the others are POP3 the goal is to be able to 
migrate all mailbox from my old 

 

infrastructure so around 30 000 mailbox.

 

I’m now facing a problem : 

MDS_CLIENT_LATE_RELEASE 3 clients failing to respond to capability 
release

mdsceph-n1(mds.0): Client mda3.sogetel.net failing to respond to 
capability releaseclient_id: 1134426

mdsceph-n1(mds.0): Client mda2.sogetel.net failing to respond to 
capability releaseclient_id: 1172391

mdsceph-n2(mds.1): Client mda3.sogetel.net failing to respond to 
capability releaseclient_id: 1134426

MDS_SLOW_REQUEST 2 MDSs report slow requests

mdsceph-n1(mds.0): 112 slow requests are blocked > 30 sec

mdsceph-n2(mds.1): 323 slow requests are blocked > 30 sec

 

I can’t figure out how to fix this… 

 

Here some information’s about my cluster :



I’m running ceph luminous 12.2.5 on my 3 ceph nodes : ceph-n1, ceph-n2, 
ceph-n3.


I have 3 client identical :

LSB Version::core-4.1-amd64:core-4.1-noarch

Distributor ID: Fedora

Description:Fedora release 25 (Twenty Five)

Release:25

Codename:   TwentyFive

 

My ceph nodes :

 

CentOS Linux release 7.5.1804 (Core)

NAME="CentOS Linux"

VERSION="7 (Core)"

ID="centos"

ID_LIKE="rhel fedora"

VERSION_ID="7"

PRETTY_NAME="CentOS Linux 7 (Core)"

ANSI_COLOR="0;31"

CPE_NAME="cpe:/o:centos:centos:7"

HOME_URL="https://www.centos.org/";

BUG_REPORT_URL="https://bugs.centos.org/";

 

CENTOS_MANTISBT_PROJECT="CentOS-7"

CENTOS_MANTISBT_PROJECT_VERSION="7"

REDHAT_SUPPORT_PRODUCT="centos"

REDHAT_SUPPORT_PRODUCT_VERSION="7"

 

CentOS Linux release 7.5.1804 (Core)

CentOS Linux release 7.5.1804 (Core)

 

ceph daemon mds.ceph-n1 perf dump mds :

 

 

"mds": {

"request": 21968558,

"reply": 21954801,

"reply_latency": {

"avgcount": 21954801,

"sum": 100879.560315258,

"avgtime": 0.004594874

},

"forward": 13627,

"dir_fetch": 3327,

"dir_commit": 162830,

"dir_split": 1,

"dir_merge": 0,

"inode_max": 2147483647,

"inodes": 68767,

"inodes_top": 4524,

"inodes_bottom": 56697,

"inodes_pin_tail": 7546,

"inodes_pinned": 62304,

"inodes_expired": 1640159,

"inodes_with_caps": 62192,

"caps": 114126,

"subtrees": 14,

"traverse": 38309963,

"traverse_hit": 37606227,

"traverse_forward": 12189,

"traverse_discover": 6634,

"traverse_dir_fetch": 1769,

"traverse_remote_ino": 6,

"traverse_lock": 7731,

"load_cent": 2196856701,

"q": 0,

"exported": 143,

"exported_inodes": 291372,

"imported": 125,

"imported_inodes": 176509

 

 

Thanks for your help…

 

Regards 

 

Marc-Antoine 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] upgrade jewel to luminous with ec + cache pool

2018-09-10 Thread David Turner
Yes, migrating to 12.2.8 is fine. Migrating to not use the cache tier is as
simple as changing the ec pool mode to allow EC over writes, changing the
cache tier mode to forward, flushing the tier, and removing it. Basically
once you have EC over writes just follow the steps in the docs for removing
a cache tier.

On Mon, Sep 10, 2018, 7:29 AM Markus Hickel  wrote:

> Dear all,
>
> i am running a cephfs cluster (jewel 10.2.10) with a ec + cache pool.
> There is a thread in the ML that states skipping 10.2.11 and going to
> 11.2.8 is possible, does this work with ec + cache pool aswell ?
>
> I also wanted to ask if there is a recommended migration path from cephfs
> with ec + cache pool to cephfs with ec pool only ? Creating a second cephfs
> and moving the files would come to my mind, but maybe there is a smarter
> way ?
>
> Cheers,
> Markus
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Need help

2018-09-10 Thread marc-antoine desrochers
Hi,

 

I am currently running a ceph cluster running in CEPHFS with 3 nodes each
have 6 osd's except 1 who got 5. I got 3 mds : 2 active and 1 standby, 3
mon.

 

 

[root@ceph-n1 ~]# ceph -s

  cluster:

id: 1d97aa70-2029-463a-b6fa-20e98f3e21fb

health: HEALTH_WARN

3 clients failing to respond to capability release

2 MDSs report slow requests

 

  services:

mon: 3 daemons, quorum ceph-n1,ceph-n2,ceph-n3

mgr: ceph-n1(active), standbys: ceph-n2, ceph-n3

mds: cephfs-2/2/2 up  {0=ceph-n1=up:active,1=ceph-n2=up:active}, 1
up:standby

osd: 17 osds: 17 up, 17 in

 

  data:

pools:   2 pools, 1024 pgs

objects: 541k objects, 42006 MB

usage:   143 GB used, 6825 GB / 6969 GB avail

pgs: 1024 active+clean

 

  io:

client:   32980 B/s rd, 77295 B/s wr, 5 op/s rd, 14 op/s wr



I'm using the cephFs as a mail storage. I currently have 3500 mailbox some
of them are IMAP the others are POP3 the goal is to be able to migrate all
mailbox from my old 

 

infrastructure so around 30 000 mailbox.

 

I'm now facing a problem : 

MDS_CLIENT_LATE_RELEASE 3 clients failing to respond to capability release

mdsceph-n1(mds.0): Client mda3.sogetel.net failing to respond to
capability releaseclient_id: 1134426

mdsceph-n1(mds.0): Client mda2.sogetel.net failing to respond to
capability releaseclient_id: 1172391

mdsceph-n2(mds.1): Client mda3.sogetel.net failing to respond to
capability releaseclient_id: 1134426

MDS_SLOW_REQUEST 2 MDSs report slow requests

mdsceph-n1(mds.0): 112 slow requests are blocked > 30 sec

mdsceph-n2(mds.1): 323 slow requests are blocked > 30 sec

 

I can't figure out how to fix this. 

 

Here some information's about my cluster :



I'm running ceph luminous 12.2.5 on my 3 ceph nodes : ceph-n1, ceph-n2,
ceph-n3.


I have 3 client identical :

LSB Version::core-4.1-amd64:core-4.1-noarch

Distributor ID: Fedora

Description:Fedora release 25 (Twenty Five)

Release:25

Codename:   TwentyFive

 

My ceph nodes :

 

CentOS Linux release 7.5.1804 (Core)

NAME="CentOS Linux"

VERSION="7 (Core)"

ID="centos"

ID_LIKE="rhel fedora"

VERSION_ID="7"

PRETTY_NAME="CentOS Linux 7 (Core)"

ANSI_COLOR="0;31"

CPE_NAME="cpe:/o:centos:centos:7"

HOME_URL="https://www.centos.org/";

BUG_REPORT_URL="https://bugs.centos.org/";

 

CENTOS_MANTISBT_PROJECT="CentOS-7"

CENTOS_MANTISBT_PROJECT_VERSION="7"

REDHAT_SUPPORT_PRODUCT="centos"

REDHAT_SUPPORT_PRODUCT_VERSION="7"

 

CentOS Linux release 7.5.1804 (Core)

CentOS Linux release 7.5.1804 (Core)

 

ceph daemon mds.ceph-n1 perf dump mds :

 

 

"mds": {

"request": 21968558,

"reply": 21954801,

"reply_latency": {

"avgcount": 21954801,

"sum": 100879.560315258,

"avgtime": 0.004594874

},

"forward": 13627,

"dir_fetch": 3327,

"dir_commit": 162830,

"dir_split": 1,

"dir_merge": 0,

"inode_max": 2147483647,

"inodes": 68767,

"inodes_top": 4524,

"inodes_bottom": 56697,

"inodes_pin_tail": 7546,

"inodes_pinned": 62304,

"inodes_expired": 1640159,

"inodes_with_caps": 62192,

"caps": 114126,

"subtrees": 14,

"traverse": 38309963,

"traverse_hit": 37606227,

"traverse_forward": 12189,

"traverse_discover": 6634,

"traverse_dir_fetch": 1769,

"traverse_remote_ino": 6,

"traverse_lock": 7731,

"load_cent": 2196856701,

"q": 0,

"exported": 143,

"exported_inodes": 291372,

"imported": 125,

"imported_inodes": 176509

 

 

Thanks for your help.

 

Regards 

 

Marc-Antoine 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tcmu-runner could not find handler

2018-09-10 Thread Jason Dillaman
On Mon, Sep 10, 2018 at 6:36 AM 展荣臻  wrote:
>
> hi!everyone:
>
> I want to export ceph rbd via iscsi。
> ceph version is 10.2.11,centos 7.5 kernel 3.10.0-862.el7.x86_64,
> and i also installed 
> tcmu-runner、targetcli-fb、python-rtslib、ceph-iscsi-config、 ceph-iscsi-cli。
> but when i lanuch "create pool=rbd image=disk_1 size=10G" with gwcli it  
> says "Failed : 500 INTERNAL SERVER ERROR".
> below is content of /var/log/tcmu-runner.log:
> 2018-09-10 09:29:38.856 14279 [INFO] dyn_config_start:425: event->mask: 0x800 
> 2018-09-10 09:29:38.857 14279 [INFO] dyn_config_start:425: event->mask: 0x4 
> 2018-09-10 09:29:38.857 14279 [INFO] dyn_config_start:425: event->mask: 0x400 
> 2018-09-10 09:29:38.857 14279 [INFO] dyn_config_start:425: event->mask: 
> 0x8000 2018-09-10 09:29:38.857 14279 [WARN] tcmu_conf_set_options:156: The 
> logdir option is not supported by dynamic reloading for now! 2018-09-10 
> 09:29:38.857 14279 [INFO] dyn_config_start:425: event->mask: 0x20 2018-09-10 
> 09:29:38.857 14279 [INFO] dyn_config_start:425: event->mask: 0x1 2018-09-10 
> 09:29:38.857 14279 [INFO] dyn_config_start:425: event->mask: 0x10 2018-09-10 
> 10:22:38.449 14279 [DEBUG] handle_netlink:207: cmd 1. Got header version 2. 
> Supported 2. 2018-09-10 10:22:38.450 14279 [ERROR] add_device:485: could not 
> find handler for uio0 2018-09-10 18:05:23.720 14279 [DEBUG] 
> handle_netlink:207: cmd 1. Got header version 2. Supported 2. 2018-09-10 
> 18:05:23.721 14279 [ERROR] add_device:485: could not find handler for uio0 
> 2018-09-10 18:18:24.393 14279 [DEBUG] handle_netlink:207: cmd 1. Got header 
> version 2. Supported 2. 2018-09-10 18:18:24.393 14279 [ERROR] add_device:485: 
> could not find handler for uio0
>
> in http://docs.ceph.com/docs/master/rbd/iscsi-overview/, it said required 
> Ceph Luminous or newer。
>  Can someone tell me how to get lio to support jewel?

Jewel is not supported for iSCSI (it's actually EOLed as well). I
presume that you built your own tcmu-runner? I think it's basically
saying that it cannot find the "/usr/lib64/tcmu-runner/handler_rbd.so"
plugin for tcmu-runner, which would make sense if it failed to compile
in your build environment.

> I am from China, my English is not very good, I hope you can understand。
> Thanks for any help anyone can provide!!!
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic upgrade failure

2018-09-10 Thread Sage Weil
I took a look at the mon log you sent.  A few things I noticed:

- The frequent mon elections seem to get only 2/3 mons about half of the 
time.
- The messages coming in a mostly osd_failure, and half of those seem to 
be recoveries (cancellation of the failure message).

It does smell a bit like a networking issue, or some tunable that relates 
to the messaging layer.  It might be worth looking at an OSD log for an 
osd that reported a failure and seeing what error code it coming up on the 
failed ping connection?  That might provide a useful hint (e.g., 
ECONNREFUSED vs EMFILE or something).

I'd also confirm that with nodown set the mon quorum stabilizes...

sage
 



On Mon, 10 Sep 2018, Kevin Hrpcek wrote:

> Update for the list archive.
> 
> I went ahead and finished the mimic upgrade with the osds in a fluctuating
> state of up and down. The cluster did start to normalize a lot easier after
> everything was on mimic since the random mass OSD heartbeat failures stopped
> and the constant mon election problem went away. I'm still battling with the
> cluster reacting poorly to host reboots or small map changes, but I feel like
> my current pg:osd ratio may be playing a factor in that since we are 2x normal
> pg count while migrating data to new EC pools.
> 
> I'm not sure of the root cause but it seems like the mix of luminous and mimic
> did not play well together for some reason. Maybe it has to do with the scale
> of my cluster, 871 osd, or maybe I've missed some some tuning as my cluster
> has scaled to this size.
> 
> Kevin
> 
> 
> On 09/09/2018 12:49 PM, Kevin Hrpcek wrote:
> > Nothing too crazy for non default settings. Some of those osd settings were
> > in place while I was testing recovery speeds and need to be brought back
> > closer to defaults. I was setting nodown before but it seems to mask the
> > problem. While its good to stop the osdmap changes, OSDs would come up, get
> > marked up, but at some point go down again (but the process is still
> > running) and still stay up in the map. Then when I'd unset nodown the
> > cluster would immediately mark 250+ osd down again and i'd be back where I
> > started.
> > 
> > This morning I went ahead and finished the osd upgrades to mimic to remove
> > that variable. I've looked for networking problems but haven't found any. 2
> > of the mons are on the same switch. I've also tried combinations of shutting
> > down a mon to see if a single one was the problem, but they keep electing no
> > matter the mix of them that are up. Part of it feels like a networking
> > problem but I haven't been able to find a culprit yet as everything was
> > working normally before starting the upgrade. Other than the constant mon
> > elections, yesterday I had the cluster 95% healthy 3 or 4 times, but it
> > doesn't last long since at some point the OSDs start trying to fail each
> > other through their heartbeats.
> > 2018-09-09 17:37:29.079 7eff774f5700  1 mon.sephmon1@0(leader).osd e991282
> > prepare_failure osd.39 10.1.9.2:6802/168438 from osd.49 10.1.9.3:6884/317908
> > is reporting failure:1
> > 2018-09-09 17:37:29.079 7eff774f5700  0 log_channel(cluster) log [DBG] :
> > osd.39 10.1.9.2:6802/168438 reported failed by osd.49 10.1.9.3:6884/317908
> > 2018-09-09 17:37:29.083 7eff774f5700  1 mon.sephmon1@0(leader).osd e991282
> > prepare_failure osd.93 10.1.9.9:6853/287469 from osd.372
> > 10.1.9.13:6801/275806 is reporting failure:1
> > 
> > I'm working on getting things mostly good again with everything on mimic and
> > will see if it behaves better.
> > 
> > Thanks for your input on this David.
> > 
> > 
> > [global]
> > mon_initial_members = sephmon1, sephmon2, sephmon3
> > mon_host = 10.1.9.201,10.1.9.202,10.1.9.203
> > auth_cluster_required = cephx
> > auth_service_required = cephx
> > auth_client_required = cephx
> > filestore_xattr_use_omap = true
> > public_network = 10.1.0.0/16
> > osd backfill full ratio = 0.92
> > osd failsafe nearfull ratio = 0.90
> > osd max object size = 21474836480
> > mon max pg per osd = 350
> > 
> > [mon]
> > mon warn on legacy crush tunables = false
> > mon pg warn max per osd = 300
> > mon osd down out subtree limit = host
> > mon osd nearfull ratio = 0.90
> > mon osd full ratio = 0.97
> > mon health preluminous compat warning = false
> > osd heartbeat grace = 60
> > rocksdb cache size = 1342177280
> > 
> > [mds]
> > mds log max segments = 100
> > mds log max expiring = 40
> > mds bal fragment size max = 20
> > mds cache memory limit = 4294967296
> > 
> > [osd]
> > osd mkfs options xfs = -i size=2048 -d su=512k,sw=1
> > osd recovery delay start = 30
> > osd recovery max active = 5
> > osd max backfills = 3
> > osd recovery threads = 2
> > osd crush initial weight = 0
> > osd heartbeat interval = 30
> > osd heartbeat grace = 60
> > 
> > 
> > On 09/08/2018 11:24 PM, David Turner wrote:
> > > What osd/mon/etc config settings do you have that are not default? It
> > > might be worth utilizing nodown to stop osds from marking 

[ceph-users] upgrade jewel to luminous with ec + cache pool

2018-09-10 Thread Markus Hickel
Dear all,

i am running a cephfs cluster (jewel 10.2.10) with a ec + cache pool. There is 
a thread in the ML that states skipping 10.2.11 and going to 11.2.8 is 
possible, does this work with ec + cache pool aswell ?

I also wanted to ask if there is a recommended migration path from cephfs with 
ec + cache pool to cephfs with ec pool only ? Creating a second cephfs and 
moving the files would come to my mind, but maybe there is a smarter way ?

Cheers,
Markus  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] tcmu-runner could not find handler

2018-09-10 Thread 展荣臻
hi!everyone:


I want to export ceph rbd via iscsi。
ceph version is 10.2.11,centos 7.5 kernel 3.10.0-862.el7.x86_64,
and i also installed 
tcmu-runner、targetcli-fb、python-rtslib、ceph-iscsi-config、 ceph-iscsi-cli。
but when i lanuch "create pool=rbd image=disk_1 size=10G" with gwcli it  
says "Failed : 500 INTERNAL SERVER ERROR".
below is content of /var/log/tcmu-runner.log:
2018-09-10 09:29:38.856 14279 [INFO] dyn_config_start:425: event->mask: 0x800 
2018-09-10 09:29:38.857 14279 [INFO] dyn_config_start:425: event->mask: 0x4 
2018-09-10 09:29:38.857 14279 [INFO] dyn_config_start:425: event->mask: 0x400 
2018-09-10 09:29:38.857 14279 [INFO] dyn_config_start:425: event->mask: 0x8000 
2018-09-10 09:29:38.857 14279 [WARN] tcmu_conf_set_options:156: The logdir 
option is not supported by dynamic reloading for now! 2018-09-10 09:29:38.857 
14279 [INFO] dyn_config_start:425: event->mask: 0x20 2018-09-10 09:29:38.857 
14279 [INFO] dyn_config_start:425: event->mask: 0x1 2018-09-10 09:29:38.857 
14279 [INFO] dyn_config_start:425: event->mask: 0x10 2018-09-10 10:22:38.449 
14279 [DEBUG] handle_netlink:207: cmd 1. Got header version 2. Supported 2. 
2018-09-10 10:22:38.450 14279 [ERROR] add_device:485: could not find handler 
for uio0 2018-09-10 18:05:23.720 14279 [DEBUG] handle_netlink:207: cmd 1. Got 
header version 2. Supported 2. 2018-09-10 18:05:23.721 14279 [ERROR] 
add_device:485: could not find handler for uio0 2018-09-10 18:18:24.393 14279 
[DEBUG] handle_netlink:207: cmd 1. Got header version 2. Supported 2. 
2018-09-10 18:18:24.393 14279 [ERROR] add_device:485: could not find handler 
for uio0


in http://docs.ceph.com/docs/master/rbd/iscsi-overview/, it said required Ceph 
Luminous or newer。
 Can someone tell me how to get lio to support jewel?


I am from China, my English is not very good, I hope you can understand。
Thanks for any help anyone can provide!!!___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore DB size and onode count

2018-09-10 Thread Nick Fisk
If anybody has 5 minutes could they just clarify a couple of things for me

1. onode count, should this be equal to the number of objects stored on the OSD?
Through reading several posts, there seems to be a general indication that this 
is the case, but looking at my OSD's the maths don't
work.

Eg.
ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL  %USE  VAR  PGS
 0   hdd 2.73679  1.0 2802G  1347G  1454G 48.09 0.69 115

So 3TB OSD, roughly half full. This is pure RBD workload (no snapshots or 
anything clever) so let's assume worse case scenario of
4MB objects (Compression is on however, which would only mean more objects for 
given size)
1347000/4=~336750 expected objects

sudo ceph daemon osd.0 perf dump | grep blue
"bluefs": {
"bluestore": {
"bluestore_allocated": 1437813964800,
"bluestore_stored": 2326118994003,
"bluestore_compressed": 445228558486,
"bluestore_compressed_allocated": 547649159168,
"bluestore_compressed_original": 1437773843456,
"bluestore_onodes": 99022,
"bluestore_onode_hits": 18151499,
"bluestore_onode_misses": 4539604,
"bluestore_onode_shard_hits": 10596780,
"bluestore_onode_shard_misses": 4632238,
"bluestore_extents": 896365,
"bluestore_blobs": 861495,

99022 onodes, anyone care to enlighten me?

2. block.db Size
sudo ceph daemon osd.0 perf dump | grep db
"db_total_bytes": 8587829248,
"db_used_bytes": 2375024640,

2.3GB=0.17% of data size. This seems a lot lower than the 1% recommendation 
(10GB for every 1TB) or 4% given in the official docs. I
know that different workloads will have differing overheads and potentially 
smaller objects. But am I understanding these figures
correctly as they seem dramatically lower?

Regards,
Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Force unmap of RBD image

2018-09-10 Thread Martin Palma
Thanks for the suggestions, and will future check for LVM volumes,
etc... the kernel version is the following 3.10.0-327.4.4.el7.x86_64
and the OS is CentOS 7.2.1511 (Core)


Best,
Martin
On Mon, Sep 10, 2018 at 12:23 PM Ilya Dryomov  wrote:
>
> On Mon, Sep 10, 2018 at 10:46 AM Martin Palma  wrote:
> >
> > We are trying to unmap an rbd image form a host for deletion and
> > hitting the following error:
> >
> > rbd: sysfs write failed
> > rbd: unmap failed: (16) Device or resource busy
> >
> > We used commands like "lsof" and "fuser" but nothing is reported to
> > use the device. Also checked for watcher with "rados -p pool
> > listwatchers image.rbd" but there aren't any listed.
>
> The device is still open by someone.  Check for LVM volumes, multipath,
> loop devices etc.  None of those typically show up in lsof.
>
> >
> > By investigating `/sys/kernel/debug/ceph//osdc` we get:
> >
> > 160460241osd15019.b2af34image.rbd
> > 231954'1271503593144320watch
>
> Which kernel is that?
>
> >
> > Our goal is to unmap the image for deletion so if the unmap process
> > should destroy the image is for us OK.
> >
> > Any help/suggestions?
>
> On newer kernels you could do "rbd umap -o force ", but it
> looks like you are running an older kernel.
>
> Thanks,
>
> Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Force unmap of RBD image

2018-09-10 Thread Ilya Dryomov
On Mon, Sep 10, 2018 at 10:46 AM Martin Palma  wrote:
>
> We are trying to unmap an rbd image form a host for deletion and
> hitting the following error:
>
> rbd: sysfs write failed
> rbd: unmap failed: (16) Device or resource busy
>
> We used commands like "lsof" and "fuser" but nothing is reported to
> use the device. Also checked for watcher with "rados -p pool
> listwatchers image.rbd" but there aren't any listed.

The device is still open by someone.  Check for LVM volumes, multipath,
loop devices etc.  None of those typically show up in lsof.

>
> By investigating `/sys/kernel/debug/ceph//osdc` we get:
>
> 160460241osd15019.b2af34image.rbd
> 231954'1271503593144320watch

Which kernel is that?

>
> Our goal is to unmap the image for deletion so if the unmap process
> should destroy the image is for us OK.
>
> Any help/suggestions?

On newer kernels you could do "rbd umap -o force ", but it
looks like you are running an older kernel.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Tiering stats are blank on Bluestore OSD's

2018-09-10 Thread Nick Fisk
After upgrading a number of OSD's to Bluestore I have noticed that the cache 
tier OSD's which have so far been upgraded are no
longer logging tier_* stats

"tier_promote": 0,
"tier_flush": 0,
"tier_flush_fail": 0,
"tier_try_flush": 0,
"tier_try_flush_fail": 0,
"tier_evict": 0,
"tier_whiteout": 0,
"tier_dirty": 0,
"tier_clean": 0,
"tier_delay": 0,
"tier_proxy_read": 0,
"tier_proxy_write": 0,
"osd_tier_flush_lat": {
"osd_tier_promote_lat": {
"osd_tier_r_lat": {

Example from Filestore OSD (both are running 12.2.8)
"tier_promote": 265140,
"tier_flush": 0,
"tier_flush_fail": 0,
"tier_try_flush": 88942,
"tier_try_flush_fail": 0,
"tier_evict": 264773,
"tier_whiteout": 35,
"tier_dirty": 89314,
"tier_clean": 89207,
"tier_delay": 0,
"tier_proxy_read": 1446068,
"tier_proxy_write": 10957517,
"osd_tier_flush_lat": {
"osd_tier_promote_lat": {
"osd_tier_r_lat": {

"New Issue" button on tracker seems to cause a 500 error btw

Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados performance inconsistencies, lower than expected performance

2018-09-10 Thread Menno Zonneveld


-Original message-
> From:Alwin Antreich 
> Sent: Thursday 6th September 2018 18:36
> To: ceph-users 
> Cc: Menno Zonneveld ; Marc Roos 
> Subject: Re: [ceph-users] Rados performance inconsistencies, lower than 
> expected performance
> 
> On Thu, Sep 06, 2018 at 05:15:26PM +0200, Marc Roos wrote:
> > 
> > It is idle, testing still, running a backup's at night on it.
> > How do you fill up the cluster so you can test between empty and full? 
> > Do you have a "ceph df" from empty and full? 
> > 
> > I have done another test disabling new scrubs on the rbd.ssd pool (but 
> > still 3 on hdd) with:
> > ceph tell osd.* injectargs --osd_max_backfills=0
> > Again getting slower towards the end.
> > Bandwidth (MB/sec): 395.749
> > Average Latency(s): 0.161713
> In the results you both had, the latency is twice as high as in our
> tests [1]. That can already make quiet some difference. Depending on the
> actual hardware used, there may or may not be the possibility for good
> optimisation.
> 
> As a start, you could test the disks with fio, as shown in our benchmark
> paper, to get some results for comparison. The forum thread [1] has
> some benchmarks from other users for comparison.
> 
> [1] https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

Thanks for the suggestion, I redid the fio test and one server seem to be 
causing trouble.

When I initially tested our SSD's according to the benchmark paper our Intel 
SSD's performed more or less equal to the Samsung SSD's used.

from fio.log

fio: (groupid=0, jobs=1): err= 0: pid=3606315: Mon Sep 10 11:12:36 2018
  write: io=4005.9MB, bw=68366KB/s, iops=17091, runt= 60001msec
slat (usec): min=5, max=252, avg= 5.76, stdev= 0.66
clat (usec): min=6, max=949, avg=51.72, stdev= 9.54
 lat (usec): min=54, max=955, avg=57.48, stdev= 9.56

However one of the other machines (with identical SSD's) now performs poorly 
compared to the others with these results

fio: (groupid=0, jobs=1): err= 0: pid=3893600: Mon Sep 10 11:15:17 2018
  write: io=1258.8MB, bw=51801KB/s, iops=12950, runt= 24883msec
slat (usec): min=5, max=259, avg= 6.17, stdev= 0.78
clat (usec): min=53, max=857, avg=69.77, stdev=13.11
 lat (usec): min=70, max=863, avg=75.93, stdev=13.17

I'll first resolve the slower machine before doing more testing as this surely 
won't help overall performance.


> --
> Cheers,
> Alwin

Thanks!,
Menno
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados performance inconsistencies, lower than expected performance

2018-09-10 Thread Menno Zonneveld
I filled up the cluster by accident by not supplying --no-cleanup to the write 
benchmark, I'm sure there must be a better way for that though.

I've run the tests again and when the cluster is 'empty' (I have a few test 
VM's stored on CEPH) and let it fill up again.

Performance goes up from 276.812 to 433.859 MB/sec and latency goes down from 
0.231178 to 0.147433.

I do have to mention I did find a problem with the cluster thanks to Alwin's 
suggestion to (re)do fio benchmarks, one server with identical SSD's is 
performing poorly compared to the others, I'll resolve this first before 
continuing other benchmarks.

When empty:

# ceph df

GLOBAL:
    SIZE  AVAIL RAW USED %RAW USED 
    3784G 2488G    1295G 34.24 
POOLS:
    NAME ID USED %USED MAX AVAIL OBJECTS 
    ssd  1  431G 37.33  723G  110984 
    rbdbench 76    0 0  723G   0 

# rados bench -p rbdbench 180 write -b 4M -t 16 --no-cleanup

Total time run: 180.223580
Total writes made:  12472
Write size: 4194304
Object size:    4194304
Bandwidth (MB/sec): 276.812
Stddev Bandwidth:   66.2295
Max bandwidth (MB/sec): 524
Min bandwidth (MB/sec): 112
Average IOPS:   69
Stddev IOPS:    16
Max IOPS:   131
Min IOPS:   28
Average Latency(s): 0.231178
Stddev Latency(s):  0.19153
Max latency(s): 1.16432
Min latency(s): 0.022585

And after a few benchmarks when I hit CEPH's warning near-full.:

# ceph df

GLOBAL:
    SIZE  AVAIL RAW USED %RAW USED 
    3784G  751G    3032G 80.13 
POOLS:
    NAME ID USED %USED MAX AVAIL OBJECTS 
    ssd  1  431G 82.93    90858M  110984 
    rbdbench 76 579G 86.73    90858M  148467 

# rados bench -p rbdbench 180 write -b 4M -t 16 --no-cleanup

Total time run: 180.233495
Total writes made:  19549
Write size: 4194304
Object size:    4194304
Bandwidth (MB/sec): 433.859
Stddev Bandwidth:   73.0601
Max bandwidth (MB/sec): 584
Min bandwidth (MB/sec): 220
Average IOPS:   108
Stddev IOPS:    18
Max IOPS:   146
Min IOPS:   55
Average Latency(s): 0.147433
Stddev Latency(s):  0.103518
Max latency(s): 1.08162
Min latency(s): 0.0218688


-Original message-
> From:Marc Roos 
> Sent: Thursday 6th September 2018 17:15
> To: ceph-users ; Menno Zonneveld 
> Subject: RE: [ceph-users] Rados performance inconsistencies, lower than 
> expected performance
> 
> 
> It is idle, testing still, running a backup's at night on it.
> How do you fill up the cluster so you can test between empty and full? 
> Do you have a "ceph df" from empty and full? 
> 
> I have done another test disabling new scrubs on the rbd.ssd pool (but 
> still 3 on hdd) with:
> ceph tell osd.* injectargs --osd_max_backfills=0
> Again getting slower towards the end.
> Bandwidth (MB/sec): 395.749
> Average Latency(s): 0.161713
> 
> 
> -Original Message-
> From: Menno Zonneveld [mailto:me...@1afa.com] 
> Sent: donderdag 6 september 2018 16:56
> To: Marc Roos; ceph-users
> Subject: RE: [ceph-users] Rados performance inconsistencies, lower than 
> expected performance
> 
> The benchmark does fluctuate quite a bit that's why I run it for 180 
> seconds now as then I do get consistent results.
> 
> Your performance seems on par with what I'm getting with 3 nodes and 9 
> OSD's, not sure what to make of that.
> 
> Are your machines actively used perhaps? Mine are mostly idle as it's 
> still a test setup.
> 
> -Original message-
> > From:Marc Roos 
> > Sent: Thursday 6th September 2018 16:23
> > To: ceph-users ; Menno Zonneveld 
> > 
> > Subject: RE: [ceph-users] Rados performance inconsistencies, lower 
> > than expected performance
> > 
> > 
> > 
> > I am on 4 nodes, mostly hdds, and 4x samsung sm863 480GB 2x E5-2660 2x 
> 
> > LSI SAS2308 1x dual port 10Gbit (one used, and shared between 
> > cluster/client vlans)
> > 
> > I have 5 pg's scrubbing, but I am not sure if there is any on the ssd 
> > pool. I am noticing a drop in the performance at the end of the test.
> > Maybe some caching on the ssd?
> > 
> > rados bench -p rbd.ssd 60 write -b 4M -t 16
> > Bandwidth (MB/sec): 448.465
> > Average Latency(s): 0.142671
> > 
> > rados bench -p rbd.ssd 180 write -b 4M -t 16
> > Bandwidth (MB/sec): 381.998
> > Average Latency(s): 0.167524
> > 
> > 
> > -Original Message-
> > From: Menno Zonneveld [mailto:me...@1afa.com]
> > Sent: donderdag 6 september 2018 15:52
> > To: Marc Roos; ceph-users
> > Subject: RE: [ceph-users] Rados performance inconsistencies, lower 
> > than expected performance
> > 
> > ah yes, 3x replicated with minimal 2.
> > 
> > 
> > my ceph.conf is pretty bare, just in case it might be relevant
> > 
> > [global]
>

[ceph-users] Force unmap of RBD image

2018-09-10 Thread Martin Palma
We are trying to unmap an rbd image form a host for deletion and
hitting the following error:

rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy

We used commands like "lsof" and "fuser" but nothing is reported to
use the device. Also checked for watcher with "rados -p pool
listwatchers image.rbd" but there aren't any listed.

By investigating `/sys/kernel/debug/ceph//osdc` we get:

160460241osd15019.b2af34image.rbd
231954'1271503593144320watch

Our goal is to unmap the image for deletion so if the unmap process
should destroy the image is for us OK.

Any help/suggestions?

Best,
Martin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic upgrade failure

2018-09-10 Thread Janne Johansson
Den mån 10 sep. 2018 kl 08:10 skrev Kevin Hrpcek :

> Update for the list archive.
>
> I went ahead and finished the mimic upgrade with the osds in a fluctuating
> state of up and down. The cluster did start to normalize a lot easier after
> everything was on mimic since the random mass OSD heartbeat failures
> stopped and the constant mon election problem went away. I'm still battling
> with the cluster reacting poorly to host reboots or small map changes, but
> I feel like my current pg:osd ratio may be playing a factor in that since
> we are 2x normal pg count while migrating data to new EC pools.
>

We found a setting to help us when we had constant reelections, though they
were lots more frequent, and not related in the least to Mimic, but bumping
the time between elections allowed our cluster to at least start. It voted,
decided on a master, the master started (re)playing transactions, got so
busy the others called for a new election, same mon won again, restarted
the job and repeated over that. Bumping the election to last 30s instead of
the default (5?) allowed the mon to finish looking over the things to do
and start replying to heartbeats as expected and then it went smoother from
there.

mon_lease = 30 for future reference.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com