The real impact of changing min_size to 1 , is not about the possibility of
losing data ,but how much data it will lost.. in both case you will lost
some data , just how much.
Let PG X -> (osd A, B, C), min_size = 2, size =3
In your description,
T1, OSD A goes down due to upgrade, now the PG is in degraded mode with
(B,C), note that the PG is still active so that there is data only
written to B and C.
T2 , B goes down to due to disk failure. C is the only one holding the
portion of data between [T1, T2].
The failure rate of C, in this situation , is independent to whether we
continue writing to C.
if C failed in T3,
w/o changing min_size, you lost data from [T1, T2] together with data
unavailable from [T2,T3]
changing min_size = 1 , you lost data from [T1, T3]
But agree, it is a tradeoff, depending on how you believe you wont have
two drive failure in a row within 15 mins upgrade window...
Wido den Hollander <[email protected]> 于2019年7月25日周四 下午3:39写道:
>
>
> On 7/25/19 9:19 AM, Xiaoxi Chen wrote:
> > We had hit this case in production but my solution will be change
> > min_size = 1 immediately so that PG back to active right after.
> >
> > It somewhat tradeoff reliability(durability) with availability during
> > that window of 15 mins but if you are certain one out of two "failure"
> > is due to recoverable issue, it worth to do so.
> >
>
> That's actually dangerous imho.
>
> Because while you set min_size=1 you will be mutating data on that
> single disk/OSD.
>
> If the other two OSDs come back recovery will start. Now IF that single
> disk/OSD now dies while performing the recovery you have lost data.
>
> The PG (or PGs) becomes inactive and you either need to perform data
> recovery on the failed disk or revert back to the last state.
>
> I can't take that risk in this situation.
>
> Wido
>
> > My 0.02
> >
> > Wido den Hollander <[email protected] <mailto:[email protected]>> 于2019年7月25
> > 日周四 上午3:48写道:
> >
> >
> >
> > On 7/24/19 9:35 PM, Mark Schouten wrote:
> > > I’d say the cure is worse than the issue you’re trying to fix, but
> > that’s my two cents.
> > >
> >
> > I'm not completely happy with it either. Yes, the price goes up and
> > latency increases as well.
> >
> > Right now I'm just trying to find a clever solution to this. It's a
> 2k
> > OSD cluster and the likelihood of an host or OSD crashing is
> reasonable
> > while you are performing maintenance on a different host.
> >
> > All kinds of things have crossed my mind where using size=4 is one
> > of them.
> >
> > Wido
> >
> > > Mark Schouten
> > >
> > >> Op 24 jul. 2019 om 21:22 heeft Wido den Hollander <[email protected]
> > <mailto:[email protected]>> het volgende geschreven:
> > >>
> > >> Hi,
> > >>
> > >> Is anybody using 4x (size=4, min_size=2) replication with Ceph?
> > >>
> > >> The reason I'm asking is that a customer of mine asked me for a
> > solution
> > >> to prevent a situation which occurred:
> > >>
> > >> A cluster running with size=3 and replication over different
> > racks was
> > >> being upgraded from 13.2.5 to 13.2.6.
> > >>
> > >> During the upgrade, which involved patching the OS as well, they
> > >> rebooted one of the nodes. During that reboot suddenly a node in a
> > >> different rack rebooted. It was unclear why this happened, but
> > the node
> > >> was gone.
> > >>
> > >> While the upgraded node was rebooting and the other node crashed
> > about
> > >> 120 PGs were inactive due to min_size=2
> > >>
> > >> Waiting for the nodes to come back, recovery to finish it took
> > about 15
> > >> minutes before all VMs running inside OpenStack were back again.
> > >>
> > >> As you are upgraded or performing any maintenance with size=3 you
> > can't
> > >> tolerate a failure of a node as that will cause PGs to go
> inactive.
> > >>
> > >> This made me think about using size=4 and min_size=2 to prevent
> this
> > >> situation.
> > >>
> > >> This obviously has implications on write latency and cost, but it
> > would
> > >> prevent such a situation.
> > >>
> > >> Is anybody here running a Ceph cluster with size=4 and min_size=2
> for
> > >> this reason?
> > >>
> > >> Thank you,
> > >>
> > >> Wido
> > >> _______________________________________________
> > >> ceph-users mailing list
> > >> [email protected] <mailto:[email protected]>
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > [email protected] <mailto:[email protected]>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com