Hi James...
I am assuming that you have properly removed the dead OSDs from the
crush map.
I've tested a scenario like the one you described and realized that pgs
will never leave the creating state until you restart all OSDs.
Have you done that?
Cheers
Goncalo
On 10/15/2015 01:54 AM, James Green wrote:
Hello,
We recently had 2 nodes go down in our ceph cluster, one was repaired
and the other had all 12 osds destroyed when it went down. We brought
everything back online, there were several PGs that were showing as
down+peering as well as down. After marking the failed OSDs as lost
and removing them from the cluster we now have around 90 PGs that are
showing as incomplete. At this point we just want to get the cluster
back up and in a healthy state. I tried recreating the PGs using
force_create_pg and now they are all stuck in creating.
PG dump shows 90 pgs all with the same output
2.182 0 0 0 0 0 0 0 0
creating 2015-10-14 10:31:28.832527 0'0 0:0
[] -1 [] -1 0'0 0.000000 0'0 0.000000
When I ran pg query on one of the groups I noticed under
"down_osds_we_would_probe" one of the failed OSDs was listed. I
already removed the OSD from the cluster, trying to mark it lost says
the OSD does not exist.
Here is my crushmap http://pastebin.com/raw.php?i=vyk9vMT1
<http://pastebin.com/raw.php?i=vyk9vMT1>
Why are the PGs trying to query osds that have been lost and removed
from the cluster?
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW 2006
T: +61 2 93511937
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com