Dear List,
We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose by
12 nodes, each nodes have 10 OSD with journal on disk.
We have one rbd partition and a radosGW with 2 data pool, one replicated, one
EC (8+2)
in attachment few details on our cluster.
Currently, our cluster is not usable at all due to too much OSD instability.
OSDs daemon die randomly with "hit suicide timeout". Yesterday, all
of 120 OSDs died at least 12 time (max 74 time) with an average around 40 time
here logs from ceph mon and from one OSD :
http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB)
http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB)
We have stopped all clients i/o to see if the cluster get stable without
success, to avoid endless rebalancing with OSD flapping, we had to
"set noout" the cluster. For now we have no idea what's going on.
Anyone can help us to understand what's happening ?
thanks for your help
--
Yoann Moulin
EPFL IC-IT
$ ceph --version
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
$ uname -a
Linux icadmin004 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ ceph osd pool ls detail
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 4927 flags hashpspool stripe_width 0
removed_snaps [1~3]
pool 3 '.rgw.root' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 258 flags hashpspool stripe_width 0
pool 4 'default.rgw.control' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 259 flags hashpspool stripe_width 0
pool 5 'default.rgw.data.root' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 260 owner 18446744073709551615 flags hashpspool stripe_width 0
pool 6 'default.rgw.gc' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 261 owner 18446744073709551615 flags hashpspool stripe_width 0
pool 7 'default.rgw.log' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 262 owner 18446744073709551615 flags hashpspool stripe_width 0
pool 8 'erasure.rgw.buckets.index' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 271 flags hashpspool stripe_width 0
pool 9 'erasure.rgw.buckets.extra' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 272 flags hashpspool stripe_width 0
pool 11 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 276 flags hashpspool stripe_width 0
pool 12 'default.rgw.buckets.extra' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 277 flags hashpspool stripe_width 0
pool 14 'default.rgw.users.uid' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 311 flags hashpspool stripe_width 0
pool 15 'default.rgw.users.keys' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 313 flags hashpspool stripe_width 0
pool 16 'default.rgw.meta' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 315 flags hashpspool stripe_width 0
pool 17 'default.rgw.users.swift' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 320 owner 18446744073709551615 flags hashpspool stripe_width 0
pool 18 'default.rgw.users.email' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 322 owner 18446744073709551615 flags hashpspool stripe_width 0
pool 19 'default.rgw.usage' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 353 flags hashpspool stripe_width 0
pool 20 'default.rgw.buckets.data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 4918 flags hashpspool stripe_width 0
pool 26 '.rgw.control' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 3549 flags hashpspool stripe_width 0
pool 27 '.rgw' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 3551 flags hashpspool stripe_width 0
pool 28 '.rgw.gc' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 3552 flags hashpspool stripe_width 0
pool 29 '.log' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 3553 flags hashpspool stripe_width 0
pool 30 'test' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 4910 flags hashpspool stripe_width 0
pool 31 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 4912 flags hashpspool stripe_width 0
pool 34 'cephfs_data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 26021 flags hashpspool crash_replay_interval 45 stripe_width 0
pool 35 'cephfs_metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 26019 flags hashpspool stripe_width 0
pool 37 'erasure.rgw.buckets' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 31463 flags hashpspool stripe_width 0
pool 38 'default.rgw.buckets' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 31466 flags hashpspool stripe_width 0
pool 39 'erasure.rgw.buckets.data' erasure size 10 min_size 8 crush_ruleset 3 object_hash rjenkins pg_num 128 pgp_num 128 last_change 31469 flags hashpspool stripe_width 4096
$ ceph -s
cluster f9dfd27f-c704-4d53-9aa0-4a23d655c7c4
health HEALTH_ERR
95 pgs are stuck inactive for more than 300 seconds
1577 pgs degraded
15 pgs down
15 pgs peering
95 pgs stuck inactive
1497 pgs stuck unclean
1577 pgs undersized
1 requests are blocked > 32 sec
recovery 14191345/255016286 objects degraded (5.565%)
recovery 1595762/255016286 objects misplaced (0.626%)
7/120 in osds are down
noout,sortbitwise flag(s) set
monmap e1: 3 mons at {node002.cluster.localdomain=10.90.37.3:6789/0,node010.cluster.localdomain=10.90.37.11:6789/0,node018.cluster.localdomain=10.90.37.19:6789/0}
election epoch 64, quorum 0,1,2 node002.cluster.localdomain,node010.cluster.localdomain,node018.cluster.localdomain
fsmap e131: 1/1/1 up {0=node022.cluster.localdomain=up:active}, 2 up:standby
osdmap e72842: 144 osds: 137 up, 120 in; 16 remapped pgs
flags noout,sortbitwise
pgmap v4819062: 9408 pgs, 28 pools, 153 TB data, 75849 kobjects
449 TB used, 203 TB / 653 TB avail
14191345/255016286 objects degraded (5.565%)
1595762/255016286 objects misplaced (0.626%)
7810 active+clean
1497 active+undersized+degraded
80 undersized+degraded+peered
15 down+remapped+peering
4 active+clean+scrubbing
2 active+clean+scrubbing+deep
client io 0 B/s wr, 0 op/s rd, 23 op/s wr
$ ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
653T 203T 449T 68.83
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 50122G 21.18 38608G 12912190
.rgw.root 3 2856 0 38608G 15
default.rgw.control 4 0 0 38608G 12
default.rgw.data.root 5 19800 0 38608G 64
default.rgw.gc 6 0 0 38608G 34
default.rgw.log 7 0 0 38608G 285
erasure.rgw.buckets.index 8 0 0 38608G 6
erasure.rgw.buckets.extra 9 0 0 38608G 119
default.rgw.buckets.index 11 0 0 38608G 49
default.rgw.buckets.extra 12 0 0 38608G 115
default.rgw.users.uid 14 3817 0 38608G 12
default.rgw.users.keys 15 206 0 38608G 17
default.rgw.meta 16 40330 0 38608G 127
default.rgw.users.swift 17 21 0 38608G 2
default.rgw.users.email 18 79 0 38608G 6
default.rgw.usage 19 0 0 38608G 6
default.rgw.buckets.data 20 99929G 42.28 38608G 61525581
.rgw.control 26 0 0 38608G 8
.rgw 27 0 0 38608G 0
.rgw.gc 28 0 0 38608G 0
.log 29 0 0 38608G 0
test 30 0 0 38608G 0
data 31 5478M 0 38608G 87663
cephfs_data 34 0 0 38608G 0
cephfs_metadata 35 2068 0 38608G 20
erasure.rgw.buckets 37 0 0 38608G 0
default.rgw.buckets 38 0 0 38608G 0
erasure.rgw.buckets.data 39 7604G 1.35 92661G 3143729
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com