Re: [ceph-users] Some OSD and MDS crash

Pierre BLONDEAU Wed, 02 Jul 2014 16:22:30 -0700

Like that ?

# ceph --admin-daemon /var/run/ceph/ceph-mon.william.asok version
{"version":"0.82"}
# ceph --admin-daemon /var/run/ceph/ceph-mon.jack.asok version
{"version":"0.82"}
# ceph --admin-daemon /var/run/ceph/ceph-mon.joe.asok version
{"version":"0.82"}


Pierre

Le 03/07/2014 01:17, Samuel Just a écrit :

Can you confirm from the admin socket that all monitors are running
the same version?
-Sam

On Wed, Jul 2, 2014 at 4:15 PM, Pierre BLONDEAU
<[email protected]> wrote:

Le 03/07/2014 00:55, Samuel Just a écrit :

Ah,

~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush
/tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i >
/tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
../ceph/src/osdmaptool: osdmap file
'osd-20_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush20
../ceph/src/osdmaptool: osdmap file
'osd-23_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush23
6d5
< tunable chooseleaf_vary_r 1

Looks like the chooseleaf_vary_r tunable somehow ended up divergent?

Pierre: do you recall how and when that got set?



I am not sure to understand, but if I good remember after the update in
firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I
see "feature set mismatch" in log.

So if I good remeber, i do : ceph osd crush tunables optimal for the problem
of "crush map" and I update my client and server kernel to 3.16rc.

It's could be that ?

Pierre

-Sam

On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just <[email protected]> wrote:


Yeah, divergent osdmaps:
555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_4E62BB79__none
6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_4E62BB79__none

Joao: thoughts?
-Sam

On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
<[email protected]> wrote:


The files

When I upgrade :
   ceph-deploy install --stable firefly servers...
   on each servers service ceph restart mon
   on each servers service ceph restart osd
   on each servers service ceph restart mds

I upgraded from emperor to firefly. After repair, remap, replace, etc
... I
have some PG which pass in peering state.

I thought why not try the version 0.82, it could solve my problem. (
It's my mistake ). So, I upgrade from firefly to 0.83 with :
   ceph-deploy install --testing servers...
   ..

Now, all programs are in version 0.82.
I have 3 mons, 36 OSD and 3 mds.

Pierre

PS : I find also "inc\uosdmap.13258__0_469271DE__none" on each meta
directory.

Le 03/07/2014 00:10, Samuel Just a écrit :

Also, what version did you upgrade from, and how did you upgrade?
-Sam

On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just <[email protected]>
wrote:



Ok, in current/meta on osd 20 and osd 23, please attach all files
matching

^osdmap.13258.*

There should be one such file on each osd. (should look something like
osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
you'll want to use find).

What version of ceph is running on your mons?  How many mons do you
have?
-Sam

On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
<[email protected]> wrote:



Hi,

I do it, the log files are available here :
https://blondeau.users.greyc.fr/cephlog/debug20/

The OSD's files are really big +/- 80M .

After starting the osd.20 some other osd crash. I pass from 31 osd up
to
16.
I remark that after this the number of down+peering PG decrease from
367
to
248. It's "normal" ? May be it's temporary, the time that the cluster
verifies all the PG ?

Regards
Pierre

Le 02/07/2014 19:16, Samuel Just a écrit :

You should add

debug osd = 20
debug filestore = 20
debug ms = 1

to the [osd] section of the ceph.conf and restart the osds.  I'd
like
all three logs if possible.

Thanks
-Sam

On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
<[email protected]> wrote:




Yes, but how i do that ?

With a command like that ?

ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
--debug-ms
1'

By modify the /etc/ceph/ceph.conf ? This file is really poor
because I
use
udev detection.

When I have made these changes, you want the three log files or
only
osd.20's ?

Thank you so much for the help

Regards
Pierre

Le 01/07/2014 23:51, Samuel Just a écrit :

Can you reproduce with
debug osd = 20
debug filestore = 20
debug ms = 1
?
-Sam

On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
<[email protected]> wrote:





Hi,

I join :
      - osd.20 is one of osd that I detect which makes crash other
OSD.
      - osd.23 is one of osd which crash when i start osd.20
      - mds, is one of my MDS

I cut log file because they are to big but. All is here :
https://blondeau.users.greyc.fr/cephlog/

Regards

Le 30/06/2014 17:35, Gregory Farnum a écrit :

What's the backtrace from the crashing OSDs?

Keep in mind that as a dev release, it's generally best not to
upgrade
to unnamed versions like 0.82 (but it's probably too late to go
back
now).





I will remember it the next time ;)

-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
<[email protected]> wrote:




Hi,

After the upgrade to firefly, I have some PG in peering state.
I seen the output of 0.82 so I try to upgrade for solved my
problem.

My three MDS crash and some OSD triggers a chain reaction that
kills
other
OSD.
I think my MDS will not start because of the metadata are on
the
OSD.

I have 36 OSD on three servers and I identified 5 OSD which
makes
crash
others. If i not start their, the cluster passe in
reconstructive
state
with
31 OSD but i have 378 in down+peering state.

How can I do ? Would you more information ( os, crash log, etc
...
)
?

Regards





--
----------------------------------------------
Pierre BLONDEAU
Administrateur Systèmes & réseaux
Université de Caen
Laboratoire GREYC, Département d'informatique

tel     : 02 31 56 75 42
bureau  : Campus 2, Science 3, 406
----------------------------------------------



--
----------------------------------------------
Pierre BLONDEAU
Administrateur Systèmes & réseaux
Université de Caen
Laboratoire GREYC, Département d'informatique

tel     : 02 31 56 75 42
bureau  : Campus 2, Science 3, 406
----------------------------------------------




--
----------------------------------------------
Pierre BLONDEAU
Administrateur Systèmes & réseaux
Université de Caen
Laboratoire GREYC, Département d'informatique

tel     : 02 31 56 75 42
bureau  : Campus 2, Science 3, 406
----------------------------------------------



--
----------------------------------------------
Pierre BLONDEAU
Administrateur Systèmes & réseaux
Université de Caen
Laboratoire GREYC, Département d'informatique

tel     : 02 31 56 75 42
bureau  : Campus 2, Science 3, 406
----------------------------------------------

smime.p7s
Description: Signature cryptographique S/MIME

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Some OSD and MDS crash

Reply via email to