Actually, it was mapped to 1 the whole time, but 1 was filtered out due to being down. More generally, even if you mark osd 1 out and then back in, it will end up on osd 1 because the osd tree will be back to the same configuration, and so CRUSH will give the same answer. -Sam
----- Original Message ----- From: "Somnath Roy" <[email protected]> To: "Sage Weil" <[email protected]> Cc: "Samuel Just ([email protected])" <[email protected]>, "Ceph Development" <[email protected]> Sent: Monday, February 23, 2015 2:00:04 PM Subject: RE: Recovery question Sage, I don't understand how osd.1 will be mapped to the same pg when it comes back ? So, the peering process will get the old pg id from osd.1's pg log and map it accordingly ? Is this an optimization as if an osd mapped to the same pg we can start from the last_backfill position ? Only if a completely new osd joins the cluster the mapping will happen through CRUSH ? Sorry for bombarding with so many questions :-) Thanks & Regards Somnath -----Original Message----- From: Sage Weil [mailto:[email protected]] Sent: Monday, February 23, 2015 1:43 PM To: Somnath Roy Cc: Samuel Just ([email protected]); Ceph Development Subject: RE: Recovery question On Mon, 23 Feb 2015, Somnath Roy wrote: > Got it, thanks ! > > << We'll serve reads and writes with just [2,3] and the pg will show up as > 'degraded' > So, the moment osd.1 is down in the map [2,3] , 2 will be designated as > primary ? As my understanding is, read/write won't be served from replica > OSD, right ? The moment the mapping becomes [2,3], 2 is now the primary, and it can serve IO. The slow part is here: > If we had say > > 4: [2,3,4] > > then we'll get > > 5: [1,2,3] > > osd will realize osd.1 cannot log recovery and will install a pg_temp > so that We can't serve IO with [1,2,3] because 1 is out of date, so there is a lag until it installs the pg_temp record. There is a pull request that will preemptively calculate new mappings and set up pg_temp records that should mitigate this issue, but it needs some testing, and I think there is still room for improvement (for example, by serving IO with a usable but non-optimal pg-temp record while we are waiting for it to be removed). See https://github.com/ceph/ceph/pull/3429 There's like a bunch of similar stuff we can do to improve peering latency that we'll want ot spend some time on for infernalis. sage > > 6: [2,3,4] > > and PG will go active (serve IO). when backfill completes, it will > remove the pg_temp and > > 7: [1,2,3] > > > 3. Will the flow be similar if one of the replica OSD goes down > > instead of primary in the step '2' I mentioned earlier ? Say, osd.2 > > went down instead of osd.1 ? > > Yeah, basically the same. Who is primary doesn't really matter. > > sage > > > > > > > > Regards > > Somnath > > > > -----Original Message----- > > From: Sage Weil [mailto:[email protected]] > > Sent: Monday, February 23, 2015 1:03 PM > > To: Somnath Roy > > Cc: Samuel Just ([email protected]); Ceph Development > > Subject: Re: Recovery question > > > > On Mon, 23 Feb 2015, Somnath Roy wrote: > > > Hi, > > > Can anyone help me understand what will happen in the following scenarios > > > ? > > > > > > 1. Current PG map : 3.5 -> OSD[1,2,3] > > > > > > 2. 1 is down and new map : 3.5 -> OSD[2,3,4] > > > > More likely it's: > > > > 1: 3.5 -> [1,2,3] > > 2: 3.5 -> [2,3] (osd.1 is down) > > 3: 3.5 -> [2,3,4] (osd.1 is marked out) > > > > > 3. Need to for backfill recovery for 4 and it started > > > > If log recovery will work, we'll do that and it's nice and quick. > > If backfill is needed, we will do > > > > 4: 3.5 -> [2,3] (up=[2,3,4]) (pg_temp record added to map to > > log-recoverable OSDs) > > > > > 4. Meanwhile OSD 1 came up , it was down for short amount of time > > > > 5: 3.5 -> [1,2,3] (osd.1 is back up and in) > > > > > 5. Will pg 3.5 mapping change considering OSD 1 recovery could be > > > log based ? > > > > It will change immediately when osd.1 is back up, regardless of what > > data is where. If it's log recoverable, then no mapping changes > > will be needed. If it's not, then > > > > 6: 3.5 -> [2,3,4] (up=[1,2,3]) (add pg_temp mapping while we > > backfill osd.1) > > 7: 3.5 -> [1,2,3] (pg_temp entry removed when backfill completes) > > > > > 6. Also, if OSD 4 recovery could be log based, will there be any > > > change in pg map if OSD 1 is up during the recovery ? > > > > See above > > > > Hope that helps! > > sage > > > > ________________________________ > > > > PLEASE NOTE: The information contained in this electronic mail message is > > intended only for the use of the designated recipient(s) named above. If > > the reader of this message is not the intended recipient, you are hereby > > notified that you have received this message in error and that any review, > > dissemination, distribution, or copying of this message is strictly > > prohibited. If you have received this communication in error, please notify > > the sender by telephone or e-mail (as shown above) immediately and destroy > > any and all copies of this message in your possession (whether hard copies > > or electronically stored copies). > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to [email protected] More majordomo > info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
