Hi,
my version of ceph is 0.72.2 on scientific linux with
2.6.32-431.1.2.el6.x86_64 kernel.
after a network trouble on all my nodes. Osd flap up to down periodically.
I have to set* nodown parameter* to stabilize it. I have a public_network
and a cluster_network.
I have this message on most of osd:
2014-06-23 08:08:59.750879 7f6bd3661700 -1 osd.y 53377 he*artbeat_check: no
reply from osd.xxx ever on either front or back*, first ping sent
2014-06-22 20:06:10.055264 (cutoff 2014-06-23 08:08:24.750744)
cluster b71fecc6-0323-4f08-8b49-e8ed1ff2d4ce
health HEALTH_WARN 1 pgs backfill; 73 pgs down; 196 pgs peering; 196 pgs
stuck inactive; 197 pgs stuck unclean; recovery 592/2459924 objects
degraded (0.024%); nodown flag(s) set
monmap e5: 3 mons at
{bb-e19-x4=10.257.53.236:6789/0,cephfrontux1-r=10.257.53.241:6789/0,cephfrontux2-r=10.257.53.242:6789/0},
election epoch 202, quorum 0,1,2 bb-e19-x4,cephtux1-r,cephtux2-r
osdmap e53377: 34 osds: 33 up, 33 in
flags nodown
pgmap v5928500: 5596 pgs, 5 pools, 4755 GB data, 1212 kobjects
9466 GB used, 17248 GB / 26715 GB avail
592/2459924 objects degraded (0.024%)
5398 active+clean
1 active+remapped+wait_backfill
123 peering
73 down+peering
1 active+clean+scrubbing
grep check ceph-osd.*.log ' '| awk '{print $5,$7,'problem',$11}'|sort -u
osd.10 heartbeat_check: problem osd.0
osd.10 heartbeat_check: problem osd.11
osd.10 heartbeat_check: problem osd.19
.....
is the same for most os dlog.
I wrote some options but nothing
[osd]
osd_heartbeat_grace = 35
osd_min_down_reports = 4
osd_heartbeat_addr = 10.157.53.224
mon_osd_down_out_interval = 3000
osd_heartbeat_interval = 12
osd_mkfs_options_xfs = "-f"
mon_osd_min_down_reporters = 3
osd_mkfs_type = xfs
Have you an idea to fix it?
--
Eric Mourgaya,
Respectons la planete!
Luttons contre la mediocrite!
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com