Re: [ceph-users] Major ceph disaster

Kevin Flöh Tue, 14 May 2019 01:02:12 -0700

On 13.05.19 10:51 nachm., Lionel Bouton wrote:

Le 13/05/2019 à 16:20, Kevin Flöh a écrit :
Dear ceph experts,
[...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
Here is what happened: One osd daemon could not be started andtherefore we decided to mark the osd as lost and set it up fromscratch. Ceph started recovering and then we lost another osd withthe same behavior. We did the same as for the first osd.
With 3+1 you only allow a single OSD failure per pg at a given time.You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2separate servers (assuming standard crush rules) is a death sentencefor the data on some pgs using both of those OSD (the ones not fullyrecovered before the second failure).

OK, so the 2 OSDs (4,23) failed shortly one after the other but we thinkthat the recovery of the first was finished before the second failed.Nonetheless, both problematic pgs have been on both OSDs. We think, thatwe still have enough shards left. For one of the pgs, the recovery statelooks like this:


    "recovery_state": [
        {
            "name": "Started/Primary/Peering/Incomplete",
            "enter_time": "2019-05-09 16:11:48.625966",
            "comment": "not enough complete instances of this PG"
        },
        {
            "name": "Started/Primary/Peering",
            "enter_time": "2019-05-09 16:11:48.611171",
            "past_intervals": [
                {
                    "first": "49767",
                    "last": "59313",
                    "all_participants": [
                        {
                            "osd": 2,
                            "shard": 0
                        },
                        {
                            "osd": 4,
                            "shard": 1
                        },
                        {
                            "osd": 23,
                            "shard": 2
                        },
                        {
                            "osd": 24,
                            "shard": 0
                        },
                        {
                            "osd": 72,
                            "shard": 1
                        },
                        {
                            "osd": 79,
                            "shard": 3
                        }
                    ],
                    "intervals": [
                        {
                            "first": "58860",
                            "last": "58861",
                            "acting": "4(1),24(0),79(3)"
                        },
                        {
                            "first": "58875",
                            "last": "58877",
                            "acting": "4(1),23(2),24(0)"
                        },
                        {
                            "first": "59002",
                            "last": "59009",
                            "acting": "4(1),23(2),79(3)"
                        },
                        {
                            "first": "59010",
                            "last": "59012",
                            "acting": "2(0),4(1),23(2),79(3)"
                        },
                        {
                            "first": "59197",
                            "last": "59233",
                            "acting": "23(2),24(0),79(3)"
                        },
                        {
                            "first": "59234",
                            "last": "59313",
                            "acting": "23(2),24(0),72(1),79(3)"
                        }
                    ]
                }
            ],
            "probing_osds": [
                "2(0)",
                "4(1)",
                "23(2)",
                "24(0)",
                "72(1)",
                "79(3)"
            ],
            "down_osds_we_would_probe": [],
            "peering_blocked_by": [],
            "peering_blocked_by_detail": [
                {
                    "detail": "peering_blocked_by_history_les_bound"
                }
            ]
        },
        {
            "name": "Started",
            "enter_time": "2019-05-09 16:11:48.611121"
        }
    ],

Is there a chance to recover this pg from the shards on OSDs 2, 72, 79?ceph pg repair/deep-scrub/scrub did not work.

We are also worried about the behind on trimming of the mds or is thisnot too problematic?



MDS_TRIM 1 MDSs behind on trimming

mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming (46178/128)max_segments: 128, num_segments: 46178

Depending on the data stored (CephFS ?) you probably can recover mostof it but some of it is irremediably lost.
If you can recover the data from the failed OSD at the time theyfailed you might be able to recover some of your lost data (with thehelp of Ceph devs), if not there's nothing to do.
In the later case I'd add a new server to use at least 3+2 for a freshpool instead of 3+1 and begin moving the data to it.
The 12.2 + 13.2 mix is a potential problem in addition to the oneabove but it's a different one.
Best regards,

Lionel

The idea for the future is to set up a new ceph with 3+2 with 8 serversin total and of course with consistent versions on all nodes.



Best regards,

Kevin

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Major ceph disaster

Reply via email to