Hi,

I got a recommendation From Stephan to restart the OSDs one by one.
So I did it. It helped a bit (some IOs completed), but at the end, the state was
the same as before, and new IOs still hung.

Loïc, thanks for the advice on moving back the osd.0 and osd.4 into the game.
 
Actually this was done by simply restarting ceph on that node:
[root@qvitblhat12 ~]# date;service ceph status
Tue Dec 23 14:36:11 UTC 2014
=== osd.0 ===
osd.0: running {"version":"0.80.7"}
=== osd.4 ===
osd.4: running {"version":"0.80.7"}
[root@qvitblhat12 ~]# date;service ceph restart
Tue Dec 23 14:36:17 UTC 2014
=== osd.0 ===
=== osd.0 ===
Stopping Ceph osd.0 on qvitblhat12...kill 4527...kill 4527...done
=== osd.0 ===
create-or-move updating item name 'osd.0' weight 0.27 at location
{host=qvitblhat12,root=default} to crush map
Starting Ceph osd.0 on qvitblhat12...
Running as unit run-4398.service.
=== osd.4 ===
=== osd.4 ===
Stopping Ceph osd.4 on qvitblhat12...kill 5375...done
=== osd.4 ===
create-or-move updating item name 'osd.4' weight 0.27 at location
{host=qvitblhat12,root=default} to crush map
Starting Ceph osd.4 on qvitblhat12...
Running as unit run-4720.service.

[root@qvitblhat06 ~]# ceph osd tree
# id    weight    type name    up/down    reweight
-1    1.62    root default
-5    1.08        datacenter dc_XAT
-2    0.54            host qvitblhat10
1    0.27                osd.1    up    1    
5    0.27                osd.5    up    1    
-4    0.54            host qvitblhat12
0    0.27                osd.0    up    1    
4    0.27                osd.4    up    1    
-6    0.54        datacenter dc_QVI
-3    0.54            host qvitblhat11
2    0.27                osd.2    up    1    
3    0.27                osd.3    up    1    
[root@qvitblhat06 ~]#

This change made ceph to rebalance data, and then the miracle, as all PGs ended
up as active+clean.

[root@qvitblhat06 ~]# ceph health detail
HEALTH_WARN noscrub,nodeep-scrub flag(s) set
noscrub,nodeep-scrub flag(s) set

Well apart from being happy that the cluster is now healthy, I find it a little
bit scary of having to shake it in one direction and another
and hope that it will eventually recover, while in the meantime my users IOs are
stuck...

So is there a way to understand what happened ?

Francois
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to