Re: [ceph-users] Can't fix down+incomplete PG

Arvydas Opulskis Tue, 09 Feb 2016 23:31:23 -0800

Hi,

What is min_size for this pool? Maybe you need to decrease it for cluster to 
start recovering.


Arvydas

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Scott 
Laird
Sent: Wednesday, February 10, 2016 7:22 AM
To: 'ceph-users@lists.ceph.com' (ceph-users@lists.ceph.com) 
<ceph-users@lists.ceph.com>
Subject: [ceph-users] Can't fix down+incomplete PG

I lost a few OSDs recently.  Now my cell is unhealthy and I can't figure out 
how to get it healthy again.

OSD 3, 7, 10, and 40 died in a power outage.  Now I have 10 PGs that are 
down+incomplete, but all of them seem like they should have surviving replicas 
of all data.

I'm running 9.2.0.

$ ceph health detail | grep down
pg 18.c1 is down+incomplete, acting [11,18,9]
pg 18.47 is down+incomplete, acting [11,9,22]
pg 18.1d7 is down+incomplete, acting [5,31,24]
pg 18.1d6 is down+incomplete, acting [22,11,5]
pg 18.2af is down+incomplete, acting [19,24,18]
pg 18.2dd is down+incomplete, acting [15,11,22]
pg 18.2de is down+incomplete, acting [15,17,11]
pg 18.3e is down+incomplete, acting [25,8,18]
pg 18.3d6 is down+incomplete, acting [22,39,24]
pg 18.3e6 is down+incomplete, acting [9,23,8]

$ ceph pg 18.c1 query
{
    "state": "down+incomplete",
    "snap_trimq": "[]",
    "epoch": 960905,
    "up": [
        11,
        18,
        9
    ],
    "acting": [
        11,
        18,
        9
    ],
    "info": {
        "pgid": "18.c1",
        "last_update": "0'0",
        "last_complete": "0'0",
        "log_tail": "0'0",
        "last_user_version": 0,
        "last_backfill": "MAX",
        "last_backfill_bitwise": 0,
        "purged_snaps": "[]",
        "history": {
            "epoch_created": 595523,
            "last_epoch_started": 954170,
            "last_epoch_clean": 954170,
            "last_epoch_split": 0,
            "last_epoch_marked_full": 0,
            "same_up_since": 959988,
            "same_interval_since": 959988,
            "same_primary_since": 959988,
            "last_scrub": "613947'7736",
            "last_scrub_stamp": "2015-11-11 21:18:35.118057",
            "last_deep_scrub": "613947'7736",
            "last_deep_scrub_stamp": "2015-11-11 21:18:35.118057",
            "last_clean_scrub_stamp": "2015-11-11 21:18:35.118057"
        },
...
            "probing_osds": [
                "9",
                "11",
                "18",
                "23",
                "25"
            ],
            "down_osds_we_would_probe": [
                7,
                10
            ],
            "peering_blocked_by": []
        },
        {
            "name": "Started",
            "enter_time": "2016-02-09 20:35:57.627376"
        }
    ],
    "agent_state": {}
}

I tried replacing disks. I created a new OSD 3 and 7 but neither will start up; 
the ceph-osd task starts but never actually makes it to 'up' with nothing 
obvious in the logs.  I can post logs if that helps.  Since the OSDs were 
removed a few days ago, 'ceph osd lost' doesn't seem to help.

Is there a way to fix these PGs and get my cluster healthy again?


Scott

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Can't fix down+incomplete PG

Reply via email to