Firstly, any chance of getting node4 and node5 back up?  You can move the
disks (monitor and osd) to a new chasis, and bring it back up.  As long as
it has the same IP as the original node4 and node5, the monitor should join.

How much is the clock skewed on node2?  I haven't had problems with small
skew (~100 ms), but I've seen posts to the mailing list about large skews
(minutes) causing quorum and authentication problems.

When you say "Nevertheless on node3 every ceph * commands stay freezed", do
you by chance mean node2 instead of node3?  If so, that supports the clock
skew being a problem, preventing the commands and the OSDs
from authenticating with the monitors.

If you really did mean node3, then something strange else going on.



On Mon, Nov 17, 2014 at 7:07 AM, NEVEU Stephane <
[email protected]> wrote:

> Hi all J ,
>
>
>
> I need some help, I’m in a sad situation : i’ve lost 2 ceph server nodes
> physically (5 nodes initialy/ 5 monitors). So 3 nodes left : node1, node2,
> node3
>
> On my first node leaving, I’ve updated the crush map to remove every osds
> running on those 2 lost servers :
>
> Ceph osd crush remove osds && ceph auth del osds && ceph osd rm osds &&
> ceph osd remove my2Lostnodes
>
> So the crush map seems to be ok now on node1.
>
> Ceph osd tree on node 1 returns that every osds running on node2 are “down
> 1” and “up 1” on node 3 and “up 1” on node1. Nevertheless on node3 every
> ceph * commands stay freezed, so I’m not sure the crush map has been
> updated on node2 and node3. I don’t know how to set ods on node 2 up again.
>
> My node2 says it cannot connect to the cluster !
>
>
>
> Ceph –s on node 1 gives me (so still 5 monitors):
>
>
>
>     cluster 45d9195b-365e-491a-8853-34b46553db94
>      health HEALTH_WARN 10016 pgs degraded; 10016 pgs stuck unclean;
> recovery 181055/544038 objects degraded (33.280%); 11/33 in osds are down;
> noout flag(s) set; 2 mons down, quorum 0,1,2 node1,node2,node3; clock skew
> detected on mon.node2
>      monmap e1: 5 mons at {node1=
> 172.23.6.11:6789/0,node2=172.23.6.12:6789/0,node3=172.23.6.13:6789/0,node4=172.23.6.14:6789/0,node5=172.23.6.15:6789/0
> <http://172.23.6.14:6789/0,omcinfcph02d=172.23.6.15:6789/0,omcinfcph61d=172.23.6.11:6789/0,omcinfcph62d=172.23.6.12:6789/0,omcinfcph63d=172.23.6.13:6789/0>},
> election epoch 488, quorum 0,1,2 node1,node2,node3
>      mdsmap e48: 1/1/1 up {0=node3=up:active}
>      osdmap e3852: 33 osds: 22 up, 33 in
>             flags noout
>       pgmap v8189785: 10016 pgs, 9 pools, 705 GB data, 177 kobjects
>             2122 GB used, 90051 GB / 92174 GB avail
>             181055/544038 objects degraded (33.280%)
>                10016 active+degraded
>   client io 0 B/s rd, 233 kB/s wr, 22 op/s
>
>
>
>
>
> Thx for your help !!
>
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to