On Mon, 23 Feb 2015, Somnath Roy wrote:
> Got it, thanks !
>
> << We'll serve reads and writes with just [2,3] and the pg will show up as
> 'degraded'
> So, the moment osd.1 is down in the map [2,3] , 2 will be designated as
> primary ? As my understanding is, read/write won't be served from replica
> OSD, right ?
The moment the mapping becomes [2,3], 2 is now the primary, and it can
serve IO.
The slow part is here:
> If we had say
>
> 4: [2,3,4]
>
> then we'll get
>
> 5: [1,2,3]
>
> osd will realize osd.1 cannot log recovery and will install a pg_temp so that
We can't serve IO with [1,2,3] because 1 is out of date, so there is
a lag until it installs the pg_temp record. There is a pull request that
will preemptively calculate new mappings and set up pg_temp records that
should mitigate this issue, but it needs some testing, and I think
there is still room for improvement (for example, by serving IO with a
usable but non-optimal pg-temp record while we are waiting for it to be
removed).
See
https://github.com/ceph/ceph/pull/3429
There's like a bunch of similar stuff we can do to improve peering
latency that we'll want ot spend some time on for infernalis.
sage
>
> 6: [2,3,4]
>
> and PG will go active (serve IO). when backfill completes, it will remove
> the pg_temp and
>
> 7: [1,2,3]
>
> > 3. Will the flow be similar if one of the replica OSD goes down
> > instead of primary in the step '2' I mentioned earlier ? Say, osd.2
> > went down instead of osd.1 ?
>
> Yeah, basically the same. Who is primary doesn't really matter.
>
> sage
>
>
>
>
> >
> > Regards
> > Somnath
> >
> > -----Original Message-----
> > From: Sage Weil [mailto:[email protected]]
> > Sent: Monday, February 23, 2015 1:03 PM
> > To: Somnath Roy
> > Cc: Samuel Just ([email protected]); Ceph Development
> > Subject: Re: Recovery question
> >
> > On Mon, 23 Feb 2015, Somnath Roy wrote:
> > > Hi,
> > > Can anyone help me understand what will happen in the following scenarios
> > > ?
> > >
> > > 1. Current PG map : 3.5 -> OSD[1,2,3]
> > >
> > > 2. 1 is down and new map : 3.5 -> OSD[2,3,4]
> >
> > More likely it's:
> >
> > 1: 3.5 -> [1,2,3]
> > 2: 3.5 -> [2,3] (osd.1 is down)
> > 3: 3.5 -> [2,3,4] (osd.1 is marked out)
> >
> > > 3. Need to for backfill recovery for 4 and it started
> >
> > If log recovery will work, we'll do that and it's nice and quick. If
> > backfill is needed, we will do
> >
> > 4: 3.5 -> [2,3] (up=[2,3,4]) (pg_temp record added to map to
> > log-recoverable OSDs)
> >
> > > 4. Meanwhile OSD 1 came up , it was down for short amount of time
> >
> > 5: 3.5 -> [1,2,3] (osd.1 is back up and in)
> >
> > > 5. Will pg 3.5 mapping change considering OSD 1 recovery could be
> > > log based ?
> >
> > It will change immediately when osd.1 is back up, regardless of what
> > data is where. If it's log recoverable, then no mapping changes will
> > be needed. If it's not, then
> >
> > 6: 3.5 -> [2,3,4] (up=[1,2,3]) (add pg_temp mapping while we
> > backfill osd.1)
> > 7: 3.5 -> [1,2,3] (pg_temp entry removed when backfill completes)
> >
> > > 6. Also, if OSD 4 recovery could be log based, will there be any
> > > change in pg map if OSD 1 is up during the recovery ?
> >
> > See above
> >
> > Hope that helps!
> > sage
> >
> > ________________________________
> >
> > PLEASE NOTE: The information contained in this electronic mail message is
> > intended only for the use of the designated recipient(s) named above. If
> > the reader of this message is not the intended recipient, you are hereby
> > notified that you have received this message in error and that any review,
> > dissemination, distribution, or copying of this message is strictly
> > prohibited. If you have received this communication in error, please notify
> > the sender by telephone or e-mail (as shown above) immediately and destroy
> > any and all copies of this message in your possession (whether hard copies
> > or electronically stored copies).
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html