Re: [ceph-users] What does the differences in osd benchmarks mean?

2019-06-27 Thread Lars Täuber
Hi Nathan,

yes the osd hosts are dual-socket machines. But does this make such difference?

osd.0: bench: wrote 1 GiB in blocks of 4 MiB in 15.0133 sec at  68 MiB/sec 17 
IOPS
osd.1: bench: wrote 1 GiB in blocks of 4 MiB in 6.98357 sec at 147 MiB/sec 36 
IOPS

Doubling the IOPS?

Thanks,
Lars

Thu, 27 Jun 2019 11:16:31 -0400
Nathan Fish  ==> Ceph Users  :
> Are these dual-socket machines? Perhaps NUMA is involved?
> 
> On Thu., Jun. 27, 2019, 4:56 a.m. Lars Täuber,  wrote:
> 
> > Hi!
> >
> > In our cluster I ran some benchmarks.
> > The results are always similar but strange to me.
> > I don't know what the results mean.
> > The cluster consists of 7 (nearly) identical hosts for osds. Two of them
> > have one an additional hdd.
> > The hdds are from identical type. The ssds for the journal and wal are of
> > identical type. The configuration is identical (ssd-db-lv-size) for each
> > osd.
> > The hosts are connected the same way to the same switches.
> > This nautilus cluster was set up with ceph-ansible 4.0 on debian buster.
> >
> > This are the results of
> > # ceph --format plain tell osd.* bench
> >
> > osd.0: bench: wrote 1 GiB in blocks of 4 MiB in 15.0133 sec at 68 MiB/sec
> > 17 IOPS
> > osd.1: bench: wrote 1 GiB in blocks of 4 MiB in 6.98357 sec at 147 MiB/sec
> > 36 IOPS
> > osd.2: bench: wrote 1 GiB in blocks of 4 MiB in 6.80336 sec at 151 MiB/sec
> > 37 IOPS
> > osd.3: bench: wrote 1 GiB in blocks of 4 MiB in 12.0813 sec at 85 MiB/sec
> > 21 IOPS
> > osd.4: bench: wrote 1 GiB in blocks of 4 MiB in 8.51311 sec at 120 MiB/sec
> > 30 IOPS
> > osd.5: bench: wrote 1 GiB in blocks of 4 MiB in 6.61376 sec at 155 MiB/sec
> > 38 IOPS
> > osd.6: bench: wrote 1 GiB in blocks of 4 MiB in 14.7478 sec at 69 MiB/sec
> > 17 IOPS
> > osd.7: bench: wrote 1 GiB in blocks of 4 MiB in 12.9266 sec at 79 MiB/sec
> > 19 IOPS
> > osd.8: bench: wrote 1 GiB in blocks of 4 MiB in 15.2513 sec at 67 MiB/sec
> > 16 IOPS
> > osd.9: bench: wrote 1 GiB in blocks of 4 MiB in 9.26225 sec at 111 MiB/sec
> > 27 IOPS
> > osd.10: bench: wrote 1 GiB in blocks of 4 MiB in 13.6641 sec at 75 MiB/sec
> > 18 IOPS
> > osd.11: bench: wrote 1 GiB in blocks of 4 MiB in 13.8943 sec at 74 MiB/sec
> > 18 IOPS
> > osd.12: bench: wrote 1 GiB in blocks of 4 MiB in 13.235 sec at 77 MiB/sec
> > 19 IOPS
> > osd.13: bench: wrote 1 GiB in blocks of 4 MiB in 10.4559 sec at 98 MiB/sec
> > 24 IOPS
> > osd.14: bench: wrote 1 GiB in blocks of 4 MiB in 12.469 sec at 82 MiB/sec
> > 20 IOPS
> > osd.15: bench: wrote 1 GiB in blocks of 4 MiB in 17.434 sec at 59 MiB/sec
> > 14 IOPS
> > osd.16: bench: wrote 1 GiB in blocks of 4 MiB in 11.7184 sec at 87 MiB/sec
> > 21 IOPS
> > osd.17: bench: wrote 1 GiB in blocks of 4 MiB in 12.8702 sec at 80 MiB/sec
> > 19 IOPS
> > osd.18: bench: wrote 1 GiB in blocks of 4 MiB in 20.1894 sec at 51 MiB/sec
> > 12 IOPS
> > osd.19: bench: wrote 1 GiB in blocks of 4 MiB in 9.60049 sec at 107
> > MiB/sec 26 IOPS
> > osd.20: bench: wrote 1 GiB in blocks of 4 MiB in 15.0613 sec at 68 MiB/sec
> > 16 IOPS
> > osd.21: bench: wrote 1 GiB in blocks of 4 MiB in 17.6074 sec at 58 MiB/sec
> > 14 IOPS
> > osd.22: bench: wrote 1 GiB in blocks of 4 MiB in 16.39 sec at 62 MiB/sec
> > 15 IOPS
> > osd.23: bench: wrote 1 GiB in blocks of 4 MiB in 15.2747 sec at 67 MiB/sec
> > 16 IOPS
> > osd.24: bench: wrote 1 GiB in blocks of 4 MiB in 10.2462 sec at 100
> > MiB/sec 24 IOPS
> > osd.25: bench: wrote 1 GiB in blocks of 4 MiB in 13.5297 sec at 76 MiB/sec
> > 18 IOPS
> > osd.26: bench: wrote 1 GiB in blocks of 4 MiB in 7.46824 sec at 137
> > MiB/sec 34 IOPS
> > osd.27: bench: wrote 1 GiB in blocks of 4 MiB in 11.2216 sec at 91 MiB/sec
> > 22 IOPS
> > osd.28: bench: wrote 1 GiB in blocks of 4 MiB in 16.6205 sec at 62 MiB/sec
> > 15 IOPS
> > osd.29: bench: wrote 1 GiB in blocks of 4 MiB in 10.1477 sec at 101
> > MiB/sec 25 IOPS
> >
> >
> > The different runs differ by ±1 IOPS.
> > Why are the osds 1,2,4,5,9,19,26 faster than the others?
> >
> > Restarting an osd did change the result.
> >
> > Could someone give me hint where to look further to find the reason?
> >
> > Thanks
> > Lars
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >  


-- 
Informationstechnologie
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstraße 22-23  10117 Berlin
Tel.: +49 30 20370-352   http://www.bbaw.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] details about cloning objects using librados

2019-06-27 Thread Brad Hubbard
On Thu, Jun 27, 2019 at 8:58 PM nokia ceph  wrote:
>
> Hi Team,
>
> We have a requirement to create multiple copies of an object and currently we 
> are handling it in client side to write as separate objects and this causes 
> huge network traffic between client and cluster.
> Is there possibility of cloning an object to multiple copies using librados 
> api?
> Please share the document details if it is feasible.

It may be possible to use an object class to accomplish what you want
to achieve but the more we understand what you are trying to do, the
better the advice we can offer (at the moment your description sounds
like replication which is already part of RADOS as you know).

More on object classes from Cephalocon Barcelona in May this year:
https://www.youtube.com/watch?v=EVrP9MXiiuU

>
> Thanks,
> Muthu
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS getattr op stuck in snapshot

2019-06-27 Thread Hector Martin
On 12/06/2019 22.33, Yan, Zheng wrote:
> I have tracked down the bug. thank you for reporting this.  'echo 2 >
> /proc/sys/vm/drop_cache' should fix the hang.  If you can compile ceph
> from source, please try following patch.

I managed to get the packages built for Xenial properly and tested and
everything seems fine. I deployed it to production and got rid of the
drop_caches hack and I've seen no stuck ops for two days so far.

If there is a bug or PR opened for this can you point me to it so I can
track when it goes into a release?

Thanks!

-- 
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How does monitor know OSD is dead?

2019-06-27 Thread Bryan Henderson
What does it take for a monitor to consider an OSD down which has been dead as
a doornail since the cluster started?

A couple of times, I have seen 'ceph status' report an OSD was up, when it was
quite dead.  Recently, a couple of OSDs were on machines that failed to boot
up after a power failure.  The rest of the Ceph cluster came up, though, and
reported all OSDs up and in.  I/Os stalled, probably because they were waiting
for the dead OSDs to come back.

I waited 15 minutes, because the manual says if the monitor doesn't hear a
heartbeat from an OSD in that long (default value of mon_osd_report_timeout),
it marks it down.  But it didn't.  I did "osd down" commands for the dead OSDs
and the status changed to down and I/O started working.

And wouldn't even 15 minutes of grace be unacceptable if it means I/Os have to
wait that long before falling back to a redundant OSD?

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot delete bucket

2019-06-27 Thread David Turner
I'm still going at 452M incomplete uploads. There are guides online for
manually deleting buckets kinda at the RADOS level that tend to leave data
stranded. That doesn't work for what I'm trying to do so I'll keep going
with this and wait for that PR to come through and hopefully help with
bucket deletion.

On Thu, Jun 27, 2019 at 2:58 PM Sergei Genchev  wrote:

> @David Turner
> Did your bucket delete ever finish? I am up to 35M incomplete uploads,
> and I doubt that I actually had that many upload attempts. I could be
> wrong though.
> Is there a way to force bucket deletion, even at the cost of not
> cleaning up space?
>
> On Tue, Jun 25, 2019 at 12:29 PM J. Eric Ivancich 
> wrote:
> >
> > On 6/24/19 1:49 PM, David Turner wrote:
> > > It's aborting incomplete multipart uploads that were left around. First
> > > it will clean up the cruft like that and then it should start actually
> > > deleting the objects visible in stats. That's my understanding of it
> > > anyway. I'm int he middle of cleaning up some buckets right now doing
> > > this same thing. I'm up to `WARNING : aborted 108393000 incomplete
> > > multipart uploads`. This bucket had a client uploading to it constantly
> > > with a very bad network connection.
> >
> > There's a PR to better deal with this situation:
> >
> > https://github.com/ceph/ceph/pull/28724
> >
> > Eric
> >
> > --
> > J. Eric Ivancich
> > he/him/his
> > Red Hat Storage
> > Ann Arbor, Michigan, USA
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot delete bucket

2019-06-27 Thread Sergei Genchev
@David Turner
Did your bucket delete ever finish? I am up to 35M incomplete uploads,
and I doubt that I actually had that many upload attempts. I could be
wrong though.
Is there a way to force bucket deletion, even at the cost of not
cleaning up space?

On Tue, Jun 25, 2019 at 12:29 PM J. Eric Ivancich  wrote:
>
> On 6/24/19 1:49 PM, David Turner wrote:
> > It's aborting incomplete multipart uploads that were left around. First
> > it will clean up the cruft like that and then it should start actually
> > deleting the objects visible in stats. That's my understanding of it
> > anyway. I'm int he middle of cleaning up some buckets right now doing
> > this same thing. I'm up to `WARNING : aborted 108393000 incomplete
> > multipart uploads`. This bucket had a client uploading to it constantly
> > with a very bad network connection.
>
> There's a PR to better deal with this situation:
>
> https://github.com/ceph/ceph/pull/28724
>
> Eric
>
> --
> J. Eric Ivancich
> he/him/his
> Red Hat Storage
> Ann Arbor, Michigan, USA
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MGR Logs after Failure Testing

2019-06-27 Thread DHilsbos
Eugen;

All services are running, yes, though they didn't all start when I brought the 
host up (configured not to start because the last thing I had done is 
physically relocate the entire cluster).

All services are running, and happy.

# ceph status
  cluster:
id: 1a8a1693-fa54-4cb3-89d2-7951d4cee6a3
health: HEALTH_OK

  services:
mon: 3 daemons, quorum S700028,S700029,S700030 (age 20h)
mgr: S700028(active, since 17h), standbys: S700029, S700030
mds: cifs:1 {0=S700029=up:active} 2 up:standby
osd: 6 osds: 6 up (since 21h), 6 in (since 21h)

  data:
pools:   16 pools, 192 pgs
objects: 449 objects, 761 MiB
usage:   724 GiB used, 65 TiB / 66 TiB avail
pgs: 192 active+clean

# ceph osd tree
ID CLASS WEIGHT   TYPE NAMESTATUS REWEIGHT PRI-AFF
-1   66.17697 root default
-5   22.05899 host S700029
 2   hdd 11.02950 osd.2up  1.0 1.0
 3   hdd 11.02950 osd.3up  1.0 1.0
-7   22.05899 host S700030
 4   hdd 11.02950 osd.4up  1.0 1.0
 5   hdd 11.02950 osd.5up  1.0 1.0
-3   22.05899 host s700028
 0   hdd 11.02950 osd.0up  1.0 1.0
 1   hdd 11.02950 osd.1up  1.0 1.0

The question about configuring the MDS as failover struck me as a potential, 
since I don't remember doing that, however it look like S700029 (10.0.200.111) 
took over from S700028 (10.0.200.110) as the active MDS.

Thank you,

Dominic L. Hilsbos, MBA 
Director - Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Eugen 
Block
Sent: Thursday, June 27, 2019 8:23 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] MGR Logs after Failure Testing

Hi,

some more information about the cluster status would be helpful, such as

ceph -s
ceph osd tree

service status of all MONs, MDSs, MGRs.
Are all services up? Did you configure the spare MDS as standby for  
rank 0 so that a failover can happen?

Regards,
Eugen


Zitat von dhils...@performair.com:

> All;
>
> I built a demonstration and testing cluster, just 3 hosts  
> (10.0.200.110, 111, 112).  Each host runs mon, mgr, osd, mds.
>
> During the demonstration yesterday, I pulled the power on one of the hosts.
>
> After bringing the host back up, I'm getting several error messages  
> every second or so:
> 2019-06-26 16:01:56.424 7fcbe0af9700  0 ms_deliver_dispatch:  
> unhandled message 0x55e80a728f00 mgrreport(mds.S700030 +0-0 packed  
> 6) v7 from mds.? v2:10.0.200.112:6808/980053124
> 2019-06-26 16:01:56.425 7fcbf4cd1700  1 mgr finish mon failed to  
> return metadata for mds.S700030: (2) No such file or directory
> 2019-06-26 16:01:56.429 7fcbe0af9700  0 ms_deliver_dispatch:  
> unhandled message 0x55e809f8e600 mgrreport(mds.S700029 +110-0 packed  
> 1366) v7 from mds.0 v2:10.0.200.111:6808/2726495738
> 2019-06-26 16:01:56.430 7fcbf4cd1700  1 mgr finish mon failed to  
> return metadata for mds.S700029: (2) No such file or directory
>
> Thoughts?
>
> Thank you,
>
> Dominic L. Hilsbos, MBA
> Director - Information Technology
> Perform Air International Inc.
> dhils...@performair.com
> www.PerformAir.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs incomplete

2019-06-27 Thread Alfredo Deza
On Thu, Jun 27, 2019 at 10:36 AM ☣Adam  wrote:

> Well that caused some excitement (either that or the small power
> disruption did)!  One of my OSDs is now down because it keeps crashing
> due to a failed assert (stacktraces attached, also I'm apparently
> running mimic, not luminous).
>
> In the past a failed assert on an OSD has meant removing the disk,
> wiping it, re-adding it as a new one, and then have ceph rebuild it from
> other copies of the data.
>
> I did this all manually in the past, but I'm trying to get more familiar
> with ceph's commands.  Will the following commands do the same?
>
> ceph-volume lvm zap --destroy --osd-id 11
> # Presumably that has to be run from the node with OSD 11, not just
> # any ceph node?
> # Source: http://docs.ceph.com/docs/mimic/ceph-volume/lvm/zap


That looks correct, and yes, you would need to run on the node with OSD 11.


>
> Do I need to remove the OSD (ceph osd out 11; wait for stabilization;
> ceph osd purge 11) before I do this and run and "ceph-deploy osd create"
> afterwards?
>

I think that what you need es essentially the same as the guide for
migrating from filestore to bluestore:

http://docs.ceph.com/docs/mimic/rados/operations/bluestore-migration/


> Thanks,
> Adam
>
>
> On 6/26/19 6:35 AM, Paul Emmerich wrote:
> > Have you tried: ceph osd force-create-pg ?
> >
> > If that doesn't work: use objectstore-tool on the OSD (while it's not
> > running) and use it to force mark the PG as complete. (Don't know the
> > exact command off the top of my head)
> >
> > Caution: these are obviously really dangerous commands
> >
> >
> >
> > Paul
> >
> >
> >
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io 
> > Tel: +49 89 1896585 90
> >
> >
> > On Wed, Jun 26, 2019 at 1:56 AM ☣Adam  > > wrote:
> >
> > How can I tell ceph to give up on "incomplete" PGs?
> >
> > I have 12 pgs which are "inactive, incomplete" that won't recover.  I
> > think this is because in the past I have carelessly pulled disks too
> > quickly without letting the system recover.  I suspect the disks that
> > have the data for these are long gone.
> >
> > Whatever the reason, I want to fix it so I have a clean cluser even
> if
> > that means losing data.
> >
> > I went through the "troubleshooting pgs" guide[1] which is excellent,
> > but didn't get me to a fix.
> >
> > The output of `ceph pg 2.0 query` includes this:
> > "recovery_state": [
> > {
> > "name": "Started/Primary/Peering/Incomplete",
> > "enter_time": "2019-06-25 18:35:20.306634",
> > "comment": "not enough complete instances of this PG"
> > },
> >
> > I've already restated all OSDs in various orders, and I changed
> min_size
> > to 1 to see if that would allow them to get fixed, but no such luck.
> > These pools are not erasure coded and I'm using the Luminous release.
> >
> > How can I tell ceph to give up on these PGs?  There's nothing
> identified
> > as unfound, so mark_unfound_lost doesn't help.  I feel like `ceph osd
> > lost` might be it, but at this point the OSD numbers have been reused
> > for new disks, so I'd really like to limit the damage to the 12 PGs
> > which are incomplete if possible.
> >
> > Thanks,
> > Adam
> >
> > [1]
> >
> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MGR Logs after Failure Testing

2019-06-27 Thread Eugen Block

Hi,

some more information about the cluster status would be helpful, such as

ceph -s
ceph osd tree

service status of all MONs, MDSs, MGRs.
Are all services up? Did you configure the spare MDS as standby for  
rank 0 so that a failover can happen?


Regards,
Eugen


Zitat von dhils...@performair.com:


All;

I built a demonstration and testing cluster, just 3 hosts  
(10.0.200.110, 111, 112).  Each host runs mon, mgr, osd, mds.


During the demonstration yesterday, I pulled the power on one of the hosts.

After bringing the host back up, I'm getting several error messages  
every second or so:
2019-06-26 16:01:56.424 7fcbe0af9700  0 ms_deliver_dispatch:  
unhandled message 0x55e80a728f00 mgrreport(mds.S700030 +0-0 packed  
6) v7 from mds.? v2:10.0.200.112:6808/980053124
2019-06-26 16:01:56.425 7fcbf4cd1700  1 mgr finish mon failed to  
return metadata for mds.S700030: (2) No such file or directory
2019-06-26 16:01:56.429 7fcbe0af9700  0 ms_deliver_dispatch:  
unhandled message 0x55e809f8e600 mgrreport(mds.S700029 +110-0 packed  
1366) v7 from mds.0 v2:10.0.200.111:6808/2726495738
2019-06-26 16:01:56.430 7fcbf4cd1700  1 mgr finish mon failed to  
return metadata for mds.S700029: (2) No such file or directory


Thoughts?

Thank you,

Dominic L. Hilsbos, MBA
Director - Information Technology
Perform Air International Inc.
dhils...@performair.com
www.PerformAir.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What does the differences in osd benchmarks mean?

2019-06-27 Thread Nathan Fish
Are these dual-socket machines? Perhaps NUMA is involved?

On Thu., Jun. 27, 2019, 4:56 a.m. Lars Täuber,  wrote:

> Hi!
>
> In our cluster I ran some benchmarks.
> The results are always similar but strange to me.
> I don't know what the results mean.
> The cluster consists of 7 (nearly) identical hosts for osds. Two of them
> have one an additional hdd.
> The hdds are from identical type. The ssds for the journal and wal are of
> identical type. The configuration is identical (ssd-db-lv-size) for each
> osd.
> The hosts are connected the same way to the same switches.
> This nautilus cluster was set up with ceph-ansible 4.0 on debian buster.
>
> This are the results of
> # ceph --format plain tell osd.* bench
>
> osd.0: bench: wrote 1 GiB in blocks of 4 MiB in 15.0133 sec at 68 MiB/sec
> 17 IOPS
> osd.1: bench: wrote 1 GiB in blocks of 4 MiB in 6.98357 sec at 147 MiB/sec
> 36 IOPS
> osd.2: bench: wrote 1 GiB in blocks of 4 MiB in 6.80336 sec at 151 MiB/sec
> 37 IOPS
> osd.3: bench: wrote 1 GiB in blocks of 4 MiB in 12.0813 sec at 85 MiB/sec
> 21 IOPS
> osd.4: bench: wrote 1 GiB in blocks of 4 MiB in 8.51311 sec at 120 MiB/sec
> 30 IOPS
> osd.5: bench: wrote 1 GiB in blocks of 4 MiB in 6.61376 sec at 155 MiB/sec
> 38 IOPS
> osd.6: bench: wrote 1 GiB in blocks of 4 MiB in 14.7478 sec at 69 MiB/sec
> 17 IOPS
> osd.7: bench: wrote 1 GiB in blocks of 4 MiB in 12.9266 sec at 79 MiB/sec
> 19 IOPS
> osd.8: bench: wrote 1 GiB in blocks of 4 MiB in 15.2513 sec at 67 MiB/sec
> 16 IOPS
> osd.9: bench: wrote 1 GiB in blocks of 4 MiB in 9.26225 sec at 111 MiB/sec
> 27 IOPS
> osd.10: bench: wrote 1 GiB in blocks of 4 MiB in 13.6641 sec at 75 MiB/sec
> 18 IOPS
> osd.11: bench: wrote 1 GiB in blocks of 4 MiB in 13.8943 sec at 74 MiB/sec
> 18 IOPS
> osd.12: bench: wrote 1 GiB in blocks of 4 MiB in 13.235 sec at 77 MiB/sec
> 19 IOPS
> osd.13: bench: wrote 1 GiB in blocks of 4 MiB in 10.4559 sec at 98 MiB/sec
> 24 IOPS
> osd.14: bench: wrote 1 GiB in blocks of 4 MiB in 12.469 sec at 82 MiB/sec
> 20 IOPS
> osd.15: bench: wrote 1 GiB in blocks of 4 MiB in 17.434 sec at 59 MiB/sec
> 14 IOPS
> osd.16: bench: wrote 1 GiB in blocks of 4 MiB in 11.7184 sec at 87 MiB/sec
> 21 IOPS
> osd.17: bench: wrote 1 GiB in blocks of 4 MiB in 12.8702 sec at 80 MiB/sec
> 19 IOPS
> osd.18: bench: wrote 1 GiB in blocks of 4 MiB in 20.1894 sec at 51 MiB/sec
> 12 IOPS
> osd.19: bench: wrote 1 GiB in blocks of 4 MiB in 9.60049 sec at 107
> MiB/sec 26 IOPS
> osd.20: bench: wrote 1 GiB in blocks of 4 MiB in 15.0613 sec at 68 MiB/sec
> 16 IOPS
> osd.21: bench: wrote 1 GiB in blocks of 4 MiB in 17.6074 sec at 58 MiB/sec
> 14 IOPS
> osd.22: bench: wrote 1 GiB in blocks of 4 MiB in 16.39 sec at 62 MiB/sec
> 15 IOPS
> osd.23: bench: wrote 1 GiB in blocks of 4 MiB in 15.2747 sec at 67 MiB/sec
> 16 IOPS
> osd.24: bench: wrote 1 GiB in blocks of 4 MiB in 10.2462 sec at 100
> MiB/sec 24 IOPS
> osd.25: bench: wrote 1 GiB in blocks of 4 MiB in 13.5297 sec at 76 MiB/sec
> 18 IOPS
> osd.26: bench: wrote 1 GiB in blocks of 4 MiB in 7.46824 sec at 137
> MiB/sec 34 IOPS
> osd.27: bench: wrote 1 GiB in blocks of 4 MiB in 11.2216 sec at 91 MiB/sec
> 22 IOPS
> osd.28: bench: wrote 1 GiB in blocks of 4 MiB in 16.6205 sec at 62 MiB/sec
> 15 IOPS
> osd.29: bench: wrote 1 GiB in blocks of 4 MiB in 10.1477 sec at 101
> MiB/sec 25 IOPS
>
>
> The different runs differ by ±1 IOPS.
> Why are the osds 1,2,4,5,9,19,26 faster than the others?
>
> Restarting an osd did change the result.
>
> Could someone give me hint where to look further to find the reason?
>
> Thanks
> Lars
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MGR Logs after Failure Testing

2019-06-27 Thread DHilsbos
All;

I built a demonstration and testing cluster, just 3 hosts (10.0.200.110, 111, 
112).  Each host runs mon, mgr, osd, mds.

During the demonstration yesterday, I pulled the power on one of the hosts.

After bringing the host back up, I'm getting several error messages every 
second or so:
2019-06-26 16:01:56.424 7fcbe0af9700  0 ms_deliver_dispatch: unhandled message 
0x55e80a728f00 mgrreport(mds.S700030 +0-0 packed 6) v7 from mds.? 
v2:10.0.200.112:6808/980053124
2019-06-26 16:01:56.425 7fcbf4cd1700  1 mgr finish mon failed to return 
metadata for mds.S700030: (2) No such file or directory
2019-06-26 16:01:56.429 7fcbe0af9700  0 ms_deliver_dispatch: unhandled message 
0x55e809f8e600 mgrreport(mds.S700029 +110-0 packed 1366) v7 from mds.0 
v2:10.0.200.111:6808/2726495738
2019-06-26 16:01:56.430 7fcbf4cd1700  1 mgr finish mon failed to return 
metadata for mds.S700029: (2) No such file or directory

Thoughts?

Thank you,

Dominic L. Hilsbos, MBA 
Director - Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs incomplete

2019-06-27 Thread ☣Adam
Well that caused some excitement (either that or the small power
disruption did)!  One of my OSDs is now down because it keeps crashing
due to a failed assert (stacktraces attached, also I'm apparently
running mimic, not luminous).

In the past a failed assert on an OSD has meant removing the disk,
wiping it, re-adding it as a new one, and then have ceph rebuild it from
other copies of the data.

I did this all manually in the past, but I'm trying to get more familiar
with ceph's commands.  Will the following commands do the same?

ceph-volume lvm zap --destroy --osd-id 11
# Presumably that has to be run from the node with OSD 11, not just
# any ceph node?
# Source: http://docs.ceph.com/docs/mimic/ceph-volume/lvm/zap

Do I need to remove the OSD (ceph osd out 11; wait for stabilization;
ceph osd purge 11) before I do this and run and "ceph-deploy osd create"
afterwards?

Thanks,
Adam


On 6/26/19 6:35 AM, Paul Emmerich wrote:
> Have you tried: ceph osd force-create-pg ?
> 
> If that doesn't work: use objectstore-tool on the OSD (while it's not
> running) and use it to force mark the PG as complete. (Don't know the
> exact command off the top of my head)
> 
> Caution: these are obviously really dangerous commands
> 
> 
> 
> Paul
> 
> 
> 
> -- 
> Paul Emmerich
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io 
> Tel: +49 89 1896585 90
> 
> 
> On Wed, Jun 26, 2019 at 1:56 AM ☣Adam  > wrote:
> 
> How can I tell ceph to give up on "incomplete" PGs?
> 
> I have 12 pgs which are "inactive, incomplete" that won't recover.  I
> think this is because in the past I have carelessly pulled disks too
> quickly without letting the system recover.  I suspect the disks that
> have the data for these are long gone.
> 
> Whatever the reason, I want to fix it so I have a clean cluser even if
> that means losing data.
> 
> I went through the "troubleshooting pgs" guide[1] which is excellent,
> but didn't get me to a fix.
> 
> The output of `ceph pg 2.0 query` includes this:
>     "recovery_state": [
>         {
>             "name": "Started/Primary/Peering/Incomplete",
>             "enter_time": "2019-06-25 18:35:20.306634",
>             "comment": "not enough complete instances of this PG"
>         },
> 
> I've already restated all OSDs in various orders, and I changed min_size
> to 1 to see if that would allow them to get fixed, but no such luck.
> These pools are not erasure coded and I'm using the Luminous release.
> 
> How can I tell ceph to give up on these PGs?  There's nothing identified
> as unfound, so mark_unfound_lost doesn't help.  I feel like `ceph osd
> lost` might be it, but at this point the OSD numbers have been reused
> for new disks, so I'd really like to limit the damage to the 12 PGs
> which are incomplete if possible.
> 
> Thanks,
> Adam
> 
> [1]
> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14e) [0x7f7372987b5e]
 2: (()+0x2c4cb7) [0x7f7372987cb7]
 3: (PG::check_past_interval_bounds() const+0xae5) [0x564b8db12f05]
 4: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x1bb) [0x564b8db43f5b]
 5: (boost::statechart::simple_state, 
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
const&, void const*)+0x200) [0x564b8db92430]
 6: (boost::statechart::state_machine, 
boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base
 const&)+0x4b) [0x564b8db65a4b]
 7: (PG::handle_advance_map(std::shared_ptr, 
std::shared_ptr, std::vector >&, int, 
std::vector >&, int, PG::RecoveryCtx*)+0x213) 
[0x564b8db27ca3]
 8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, 
PG::RecoveryCtx*)+0x2b4) [0x564b8da92fa4]
 9: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr, 
ThreadPool::TPHandle&)+0xb4) [0x564b8da93704]
 10: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr&, 
ThreadPool::TPHandle&)+0x52) [0x564b8dcee862]
 11: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x926) [0x564b8daa0c26]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d6) 
[0x7f737298c666]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f737298dce0]
 14: (()+0x76db) [0x7f73710296db]
 15: (clone()+0x3f) [0x7f736fff288f]

 -1143> 2019-06-26 08:56:54.398 7f73529f5700 -1 *** Caught signal (Aborted) **
 in thread 7f73529f5700 thread_name:tp_osd_tp

 ceph version 13.2.6 (7b695f835b

[ceph-users] osd-mon failed with "failed to write to db"

2019-06-27 Thread Anton Aleksandrov

Hello community,

we have developed a cluster on latest mimic release. We are on quite old 
hardware, but using Centos7. Monitor, manager and all the same host. 
Cluster has been running for some week without actual workload. There 
might have been some sort of power failure (not proved), but at some 
point monitor node died and won't start anymore. Below is a log from 
/var/log/messages. What can be done here? Can this be recovered somehow 
or did we loose everything? All the OSDs seems to be running fine, just 
that the cluster is not working.


The log is not full, but I think that those line are quite critical..

Jun 27 17:14:06 mds1 ceph-mon: -311> 2019-06-27 17:14:06.169 
7f086aa22700 -1 *rocksdb: submit_common error: Corruption: block 
checksum mismatch*: expected 3317957558, got 2609532897  in 
/var/lib/ceph/mon/ceph-mds1/store.db/022334.sst offset 12775887 size 
21652 code = 2 Rocksdb transaction:
Jun 27 17:14:06 mds1 ceph-mon: Put( Prefix = p key = 
'xos'0x006c6173't_committed' Value size = 8)
Jun 27 17:14:06 mds1 ceph-mon: Put( Prefix = m key = 
'nitor_store'0x006c6173't_metadata' Value size = 612)
Jun 27 17:14:06 mds1 ceph-mon: Put( Prefix = l key = 
'gm'0x0066756c'l_155850' Value size = 31307)
Jun 27 17:14:06 mds1 ceph-mon: Put( Prefix = l key = 
'gm'0x0066756c'l_latest' Value size = 8)
Jun 27 17:14:06 mds1 ceph-mon: Put( Prefix = l key = 'gm'0x00313535'851' 
Value size = 672)
Jun 27 17:14:06 mds1 ceph-mon: Put( Prefix = l key = 
'gm'0x006c6173't_committed' Value size = 8)
Jun 27 17:14:06 mds1 ceph-mon: -311> 2019-06-27 17:14:06.172 
7f086aa22700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE
_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/mon/MonitorDBStore.h: 
In function
 'int 
MonitorDBStore::apply_transaction(MonitorDBStore::TransactionRef)' 
thread 7f086aa22700 time 2019-06-27 17:14:06.171474
Jun 27 17:14:06 mds1 ceph-mon: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/cento
s7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/mon/MonitorDBStore.h: 
311: FAILED assert(0 ==*"failed to write to db"*)
Jun 27 17:14:06 mds1 ceph-mon: ceph version 13.2.6 
(7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-volume ignores cluster name from ceph.conf

2019-06-27 Thread Alfredo Deza
Although ceph-volume does a best-effort to support custom cluster
names, the Ceph project does not support custom cluster names anymore
even though you can still see settings/options that will allow you to
set it.

For reference see: https://bugzilla.redhat.com/show_bug.cgi?id=1459861

On Thu, Jun 27, 2019 at 7:59 AM Stolte, Felix  wrote:
>
> Hi folks,
>
> I have a nautilus 14.2.1 cluster with a non-default cluster name (ceph_stag 
> instead of ceph). I set “cluster = ceph_stag” in /etc/ceph/ceph_stag.conf.
>
> ceph-volume is using the correct config file but does not use the specified 
> clustername. Did I hit a bug or do I need to define the clustername elsewere?
>
> Regards
> Felix
> IT-Services
> Telefon 02461 61-9243
> E-Mail: f.sto...@fz-juelich.de
> -
> -
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Prof. Dr. Sebastian M. Schmidt
> -
> -
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph-volume ignores cluster name from ceph.conf

2019-06-27 Thread Stolte, Felix
Hi folks,

I have a nautilus 14.2.1 cluster with a non-default cluster name (ceph_stag 
instead of ceph). I set “cluster = ceph_stag” in /etc/ceph/ceph_stag.conf.

ceph-volume is using the correct config file but does not use the specified 
clustername. Did I hit a bug or do I need to define the clustername elsewere?

Regards
Felix
IT-Services
Telefon 02461 61-9243
E-Mail: f.sto...@fz-juelich.de
-
-
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
-
-
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] details about cloning objects using librados

2019-06-27 Thread nokia ceph
Hi Team,

We have a requirement to create multiple copies of an object and currently
we are handling it in client side to write as separate objects and this
causes huge network traffic between client and cluster.
Is there possibility of cloning an object to multiple copies using librados
api?
Please share the document details if it is feasible.

Thanks,
Muthu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph zabbix monitoring

2019-06-27 Thread Nathan Harper
Have you configured any encryption on your Zabbix infrastructure?   We took
a brief look at ceph+Zabbix a while ago, and the exporter didn't have the
capability to use encryption.   I don't know if it's changed in the
meantime though.

On Thu, 27 Jun 2019 at 09:43, Majid Varzideh  wrote:

> Hi friends
> i have installed ceph mimic with zabbix 3.0. i configured everything to
> monitor my cluster with zabbix and i could get data from zabbix frontend.
> but in ceph -s command it says Failed to send data to Zabbix.
> why this happen?
> my ceph version :ceph version 13.2.6
> (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
> zabbix 3.0.14
> thanks,
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
*Nathan Harper* // IT Systems Lead

*e: *nathan.har...@cfms.org.uk   *t*: 0117 906 1104  *m*:  0787 551 0891
*w: *www.cfms.org.uk
CFMS Services Ltd // Bristol & Bath Science Park // Dirac Crescent // Emersons
Green // Bristol // BS16 7FR

CFMS Services Ltd is registered in England and Wales No 05742022 - a
subsidiary of CFMS Ltd
CFMS Services Ltd registered office // 43 Queens Square // Bristol // BS1
4QP
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph zabbix monitoring

2019-06-27 Thread Majid Varzideh
 Hi friends
i have installed ceph mimic with zabbix 3.0. i configured everything to
monitor my cluster with zabbix and i could get data from zabbix frontend.
but in ceph -s command it says Failed to send data to Zabbix.
why this happen?
my ceph version :ceph version 13.2.6
(7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
zabbix 3.0.14
thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] What does the differences in osd benchmarks mean?

2019-06-27 Thread Lars Täuber
Hi!

In our cluster I ran some benchmarks.
The results are always similar but strange to me.
I don't know what the results mean.
The cluster consists of 7 (nearly) identical hosts for osds. Two of them have 
one an additional hdd.
The hdds are from identical type. The ssds for the journal and wal are of 
identical type. The configuration is identical (ssd-db-lv-size) for each osd.
The hosts are connected the same way to the same switches.
This nautilus cluster was set up with ceph-ansible 4.0 on debian buster.

This are the results of
# ceph --format plain tell osd.* bench

osd.0: bench: wrote 1 GiB in blocks of 4 MiB in 15.0133 sec at 68 MiB/sec 17 
IOPS
osd.1: bench: wrote 1 GiB in blocks of 4 MiB in 6.98357 sec at 147 MiB/sec 36 
IOPS
osd.2: bench: wrote 1 GiB in blocks of 4 MiB in 6.80336 sec at 151 MiB/sec 37 
IOPS
osd.3: bench: wrote 1 GiB in blocks of 4 MiB in 12.0813 sec at 85 MiB/sec 21 
IOPS
osd.4: bench: wrote 1 GiB in blocks of 4 MiB in 8.51311 sec at 120 MiB/sec 30 
IOPS
osd.5: bench: wrote 1 GiB in blocks of 4 MiB in 6.61376 sec at 155 MiB/sec 38 
IOPS
osd.6: bench: wrote 1 GiB in blocks of 4 MiB in 14.7478 sec at 69 MiB/sec 17 
IOPS
osd.7: bench: wrote 1 GiB in blocks of 4 MiB in 12.9266 sec at 79 MiB/sec 19 
IOPS
osd.8: bench: wrote 1 GiB in blocks of 4 MiB in 15.2513 sec at 67 MiB/sec 16 
IOPS
osd.9: bench: wrote 1 GiB in blocks of 4 MiB in 9.26225 sec at 111 MiB/sec 27 
IOPS
osd.10: bench: wrote 1 GiB in blocks of 4 MiB in 13.6641 sec at 75 MiB/sec 18 
IOPS
osd.11: bench: wrote 1 GiB in blocks of 4 MiB in 13.8943 sec at 74 MiB/sec 18 
IOPS
osd.12: bench: wrote 1 GiB in blocks of 4 MiB in 13.235 sec at 77 MiB/sec 19 
IOPS
osd.13: bench: wrote 1 GiB in blocks of 4 MiB in 10.4559 sec at 98 MiB/sec 24 
IOPS
osd.14: bench: wrote 1 GiB in blocks of 4 MiB in 12.469 sec at 82 MiB/sec 20 
IOPS
osd.15: bench: wrote 1 GiB in blocks of 4 MiB in 17.434 sec at 59 MiB/sec 14 
IOPS
osd.16: bench: wrote 1 GiB in blocks of 4 MiB in 11.7184 sec at 87 MiB/sec 21 
IOPS
osd.17: bench: wrote 1 GiB in blocks of 4 MiB in 12.8702 sec at 80 MiB/sec 19 
IOPS
osd.18: bench: wrote 1 GiB in blocks of 4 MiB in 20.1894 sec at 51 MiB/sec 12 
IOPS
osd.19: bench: wrote 1 GiB in blocks of 4 MiB in 9.60049 sec at 107 MiB/sec 26 
IOPS
osd.20: bench: wrote 1 GiB in blocks of 4 MiB in 15.0613 sec at 68 MiB/sec 16 
IOPS
osd.21: bench: wrote 1 GiB in blocks of 4 MiB in 17.6074 sec at 58 MiB/sec 14 
IOPS
osd.22: bench: wrote 1 GiB in blocks of 4 MiB in 16.39 sec at 62 MiB/sec 15 IOPS
osd.23: bench: wrote 1 GiB in blocks of 4 MiB in 15.2747 sec at 67 MiB/sec 16 
IOPS
osd.24: bench: wrote 1 GiB in blocks of 4 MiB in 10.2462 sec at 100 MiB/sec 24 
IOPS
osd.25: bench: wrote 1 GiB in blocks of 4 MiB in 13.5297 sec at 76 MiB/sec 18 
IOPS
osd.26: bench: wrote 1 GiB in blocks of 4 MiB in 7.46824 sec at 137 MiB/sec 34 
IOPS
osd.27: bench: wrote 1 GiB in blocks of 4 MiB in 11.2216 sec at 91 MiB/sec 22 
IOPS
osd.28: bench: wrote 1 GiB in blocks of 4 MiB in 16.6205 sec at 62 MiB/sec 15 
IOPS
osd.29: bench: wrote 1 GiB in blocks of 4 MiB in 10.1477 sec at 101 MiB/sec 25 
IOPS


The different runs differ by ±1 IOPS.
Why are the osds 1,2,4,5,9,19,26 faster than the others?

Restarting an osd did change the result.

Could someone give me hint where to look further to find the reason?

Thanks
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph balancer - Some osds belong to multiple subtrees

2019-06-27 Thread Wolfgang Lendl

thx Paul - I suspect these shadow trees causing this misbehaviour.
I have a second luminous cluster where these balancer settings work as expected 
- this working one has hdd+ssd osds

i cannot use the upmap balancer because of some jewel-krbd clients - at least 
they are being reported as jewel clients

"client": {
"group": {
"features": "0x7010fb86aa42ada",
"release": "jewel",
"num": 1
},
"group": {
"features": "0x27018fb86aa42ada",
"release": "jewel",
"num": 3
},

is there a good way to decode the "features" value?

wolfgang


Am 26.06.2019 um 13:29 schrieb Paul Emmerich:
Device classes are implemented with magic invisible crush trees; 
you've got two completely independent trees internally: one for crush 
rules mapping to HDDs, one to legacy crush rules not specifying a 
device class.


The balancer *should* be aware of this and ignore it, but I'm not sure 
about the state of the balancer on Luminous. There were quite a few 
problems in older versions, lots of them have been fixed in backports.


The upmap balancer is much better than the crush-compat balancer, but 
it requires all clients to run Luminous or later.



Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io 
Tel: +49 89 1896585 90


On Wed, Jun 26, 2019 at 10:21 AM Wolfgang Lendl 
> wrote:


Hi,

tried to enable the ceph balancer on a 12.2.12 cluster and got this:

mgr[balancer] Some osds belong to multiple subtrees: [0, 1, 2, 3, 4, 5, 6, 
7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 
47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 
67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 
87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 
105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 
121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 
137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 
153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 
169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 
185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 
201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 
217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 
233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 
249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 
265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 
281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 
297, 298, 299, 300, 301, 302, 303, 304, 305]

I'm not aware of any additional subtree - maybe someone can enlighten me:

ceph balancer status
{
 "active": true,
 "plans": [],
 "mode": "crush-compat"
}

ceph osd crush tree
ID  CLASS WEIGHT (compat)  TYPE NAME
  -1   3176.04785   root default
  -7316.52490 316.52490 host node0
   0   hdd9.09560   9.09560 osd.0
   4   hdd9.09560   9.09560 osd.4
   8   hdd9.09560   9.09560 osd.8
  10   hdd9.09560   9.09560 osd.10
  12   hdd9.09560   9.09560 osd.12
  16   hdd9.09560   9.09560 osd.16
  20   hdd9.09560   9.09560 osd.20
  21   hdd9.09560   9.09560 osd.21
  26   hdd9.09560   9.09560 osd.26
  29   hdd9.09560   9.09560 osd.29
  31   hdd9.09560   9.09560 osd.31
  35   hdd9.09560   9.09560 osd.35
  37   hdd9.09560   9.09560 osd.37
  44   hdd9.09560   9.09560 osd.44
  47   hdd9.09560   9.09560 osd.47
  56   hdd9.09560   9.09560 osd.56
  59   hdd9.09560   9.09560 osd.59
  65   hdd9.09560   9.09560 osd.65
  71   hdd9.09560   9.09560 osd.71
  77   hdd9.09560   9.09560 osd.77
  80   hdd9.09560   9.09560 osd.80
  83   hdd9.09569   9.09569 osd.83
  86   hdd9.09560   9.09560 osd.86
  88   hdd9.09560   9.09560 osd.88
  94   hdd   10.91409  10.91409 osd.94
  95   hdd   10.91409  10.91409 osd.95
  98   hdd   10.91409  10.91409 osd.98
  99   hdd   10.91409  10.91409 osd.99
238   hdd9.09569   9.09569 osd.238
239   hdd9.09569   9.