Right, if you think about it, any objects written during the time without 1,2,3 really do require 4 to recover. You can reduce the risk of this by setting min_size to something greater than 8, but you also won't be able to recover with fewer than min_size, so if you set min_size to 9 and lose 1,2,3, you won't have lost data, but you won't be able to recover until you reduce min_size. It's mainly there so that you won't accept writes during a brief outage which brings you down to 8. Note, I think you could have marked osd 8 lost and then marked the unrecoverable objects lost. -Sam
On Thu, Nov 13, 2014 at 11:20 AM, GuangYang <[email protected]> wrote: > Thanks Sam for the quick response. Just want to make sure I understand it > correctly: > > If we have [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] and all of 1,2,3 are down, the > PG is active as we are using 8 + 3, and once 4 is down, even though we bring > up 1,2,3, the PG could not become active unless we bring 4 up. Is my > understanding correct here? > > Thanks, > Guang > > ---------------------------------------- >> Date: Thu, 13 Nov 2014 09:06:27 -0800 >> Subject: Re: PG down >> From: [email protected] >> To: [email protected] >> CC: [email protected] >> >> It looks like the acting set went down to the min allowable size and >> went active with osd 8. At that point you needed every member of that >> acting set to go active later on to avoiding loosing writes. You can >> prevent this by setting a min_size above the number of data chunks. >> -Sam >> >> On Thu, Nov 13, 2014 at 4:15 AM, GuangYang <[email protected]> wrote: >>> Hi Sam, >>> Yesterday there was one PG down in our cluster and I am confused by the PG >>> state, I am not sure if it is a bug (or an issue has been fixed as I see a >>> couple of related fixes in giant), it would be nice you can help to take a >>> look. >>> >>> Here is what happened: >>> >>> We are using EC pool with 8 data chunks and 3 code chunks, saying the PG >>> has up/acting set as [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], there was one OSD >>> in the set down and up, so that it triggered PG recovering. However, when >>> doing recover, the primary OSD crash as due to a corrupted file chunk, then >>> another OSD become primary, start recover and crashed, and so on so forth >>> until there are 4 OSDs down in the set and the PG is marked down. >>> >>> After that, we left the OSD having corrupted data down and started all >>> other crashed OSDs, we expected the PG could become active, however, the PG >>> is still down with the following query information: >>> >>> { "state": "down+remapped+inconsistent+peering", >>> "epoch": 4469, >>> "up": [ >>> 377, >>> 107, >>> 328, >>> 263, >>> 395, >>> 467, >>> 352, >>> 475, >>> 333, >>> 37, >>> 380], >>> "acting": [ >>> 2147483647, >>> 107, >>> 328, >>> 263, >>> 395, >>> 2147483647, >>> 352, >>> 475, >>> 333, >>> 37, >>> 380], >>> ... >>> 377]}], >>> "probing_osds": [ >>> "37(9)", >>> "107(1)", >>> "263(3)", >>> "328(2)", >>> "333(8)", >>> "352(6)", >>> "377(0)", >>> "380(10)", >>> "395(4)", >>> "467(5)", >>> "475(7)"], >>> "blocked": "peering is blocked due to down osds", >>> "down_osds_we_would_probe": [ >>> 8], >>> "peering_blocked_by": [ >>> { "osd": 8, >>> "current_lost_at": 0, >>> "comment": "starting or marking this osd lost may let us proceed"}]}, >>> { "name": "Started", >>> "enter_time": "2014-11-12 10:12:23.067369"}], >>> } >>> >>> Here osd.8 is the one having corrupted data. >>> >>> The way we worked around this issue is to set norecover and start osd.8, >>> get that PG active and then removed the object (via rados), unset norecover >>> and things become clean again. But the most confusing part is that even we >>> only left osd.8 down, the PG couldn't become active. >>> >>> We are using firefly v0.80.4. >>> >>> Thanks, >>> Guang > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to [email protected] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
