That's pretty strange, especially since the monitor is getting the
failure reports. What version are you running? Can you bump up the
monitor debugging and provide its output from around that time?
-Greg

On Fri, Feb 20, 2015 at 3:26 AM, Sudarshan Pathak <sushan....@gmail.com> wrote:
> Hello everyone,
>
> I have a cluster running with OpenStack. It has 6 OSD (3 in each 2 different
> locations). Each pool has 3 replication size with 2 copy in primary location
> and 1 copy at secondary location.
>
> Everything is running as expected but the osd are not marked as down when I
> poweroff a OSD server. It has been around an hour.
> I tried changing the heartbeat settings too.
>
> Can someone point me in right direction.
>
> OSD 0 log
> =========
> 2015-02-20 16:20:14.009723 7f3fe37d7700 -1 osd.0 451 heartbeat_check: no
> reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20
> 16:15:54.607854 (cutoff 2015-02-20 16:19:54.009720)
> 2015-02-20 16:20:15.009908 7f3fe37d7700 -1 osd.0 451 heartbeat_check: no
> reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20
> 16:15:54.607854 (cutoff 2015-02-20 16:19:55.009907)
> 2015-02-20 16:20:16.010123 7f3fe37d7700 -1 osd.0 451 heartbeat_check: no
> reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20
> 16:15:54.607854 (cutoff 2015-02-20 16:19:56.010119)
> 2015-02-20 16:20:16.648167 7f3fc9a76700 -1 osd.0 451 heartbeat_check: no
> reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20
> 16:15:54.607854 (cutoff 2015-02-20 16:19:56.648165)
>
>
> Ceph monitor log
> ================
> 2015-02-20 16:49:16.831548 7f416e4aa700  1 mon.storage1@1(leader).osd e455
> prepare_failure osd.2 192.168.100.33:6800/24431 from osd.4
> 192.168.100.35:6800/1305 is reporting failure:1
> 2015-02-20 16:49:16.831593 7f416e4aa700  0 log_channel(cluster) log [DBG] :
> osd.2 192.168.100.33:6800/24431 reported failed by osd.4
> 192.168.100.35:6800/1305
> 2015-02-20 16:49:17.080314 7f416e4aa700  1 mon.storage1@1(leader).osd e455
> prepare_failure osd.2 192.168.100.33:6800/24431 from osd.3
> 192.168.100.34:6800/1358 is reporting failure:1
> 2015-02-20 16:49:17.080527 7f416e4aa700  0 log_channel(cluster) log [DBG] :
> osd.2 192.168.100.33:6800/24431 reported failed by osd.3
> 192.168.100.34:6800/1358
> 2015-02-20 16:49:17.420859 7f416e4aa700  1 mon.storage1@1(leader).osd e455
> prepare_failure osd.2 192.168.100.33:6800/24431 from osd.5
> 192.168.100.36:6800/1359 is reporting failure:1
>
>
> #ceph osd stat
>      osdmap e455: 6 osds: 6 up, 6 in
>
>
> #ceph -s
>     cluster c8a5975f-4c86-4cfe-a91b-fac9f3126afc
>      health HEALTH_WARN 528 pgs peering; 528 pgs stuck inactive; 528 pgs
> stuck unclean; 1 requests are blocked > 32 sec; 1 mons down, quorum 1,2,3,4
> storage1,storage2,compute3,compute4
>      monmap e1: 5 mons at
> {admin=192.168.100.39:6789/0,compute3=192.168.100.133:6789/0,compute4=192.168.100.134:6789/0,storage1=192.168.100.120:6789/0,storage2=192.168.100.121:6789/0},
> election epoch 132, quorum 1,2,3,4 storage1,storage2,compute3,compute4
>      osdmap e455: 6 osds: 6 up, 6 in
>       pgmap v48474: 3650 pgs, 19 pools, 27324 MB data, 4420 objects
>             82443 MB used, 2682 GB / 2763 GB avail
>                 3122 active+clean
>                  528 remapped+peering
>
>
>
> Ceph.conf file
>
> [global]
> fsid = c8a5975f-4c86-4cfe-a91b-fac9f3126afc
> mon_initial_members = admin, storage1, storage2, compute3, compute4
> mon_host =
> 192.168.100.39,192.168.100.120,192.168.100.121,192.168.100.133,192.168.100.134
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
>
> osd pool default size = 3
> osd pool default min size = 3
>
> osd pool default pg num = 300
> osd pool default pgp num = 300
>
> public network = 192.168.100.0/24
>
> rgw print continue = false
> rgw enable ops log = false
>
> mon osd report timeout = 60
> mon osd down out interval = 30
> mon osd min down reports = 2
>
> osd heartbeat grace = 10
> osd mon heartbeat interval = 20
> osd mon report interval max = 60
> osd mon ack timeout = 15
>
> mon osd min down reports = 2
>
>
> Regards,
> Sudarshan Pathak
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to