The real impact of changing min_size to 1 , is not about the possibility of
losing data ,but how much data it will lost.. in both case you will lost
some data , just how much.

Let PG X -> (osd A, B, C), min_size = 2, size =3
In your description,

T1,   OSD A goes down due to upgrade, now the PG is in degraded mode with
 (B,C),  note that the PG is still active so that there is data only
written to B and C.

T2 , B goes down to due to disk failure.  C is the only one holding the
portion of data between [T1, T2].
The failure rate of C, in this situation  , is independent to whether we
continue writing to C.

if C failed in T3,
    w/o changing min_size, you  lost data from [T1, T2] together with data
unavailable from [T2,T3]
    changing min_size = 1 , you lost data from [T1, T3]

But agree, it is a tradeoff,  depending on how you believe you wont have
two drive failure in a row within 15 mins upgrade window...

Wido den Hollander <w...@42on.com> 于2019年7月25日周四 下午3:39写道:

>
>
> On 7/25/19 9:19 AM, Xiaoxi Chen wrote:
> > We had hit this case in production but my solution will be change
> > min_size = 1 immediately so that PG back to active right after.
> >
> > It somewhat tradeoff reliability(durability) with availability during
> > that window of 15 mins but if you are certain one out of two "failure"
> > is due to recoverable issue, it worth to do so.
> >
>
> That's actually dangerous imho.
>
> Because while you set min_size=1 you will be mutating data on that
> single disk/OSD.
>
> If the other two OSDs come back recovery will start. Now IF that single
> disk/OSD now dies while performing the recovery you have lost data.
>
> The PG (or PGs) becomes inactive and you either need to perform data
> recovery on the failed disk or revert back to the last state.
>
> I can't take that risk in this situation.
>
> Wido
>
> > My 0.02
> >
> > Wido den Hollander <w...@42on.com <mailto:w...@42on.com>> 于2019年7月25
> > 日周四 上午3:48写道:
> >
> >
> >
> >     On 7/24/19 9:35 PM, Mark Schouten wrote:
> >     > I’d say the cure is worse than the issue you’re trying to fix, but
> >     that’s my two cents.
> >     >
> >
> >     I'm not completely happy with it either. Yes, the price goes up and
> >     latency increases as well.
> >
> >     Right now I'm just trying to find a clever solution to this. It's a
> 2k
> >     OSD cluster and the likelihood of an host or OSD crashing is
> reasonable
> >     while you are performing maintenance on a different host.
> >
> >     All kinds of things have crossed my mind where using size=4 is one
> >     of them.
> >
> >     Wido
> >
> >     > Mark Schouten
> >     >
> >     >> Op 24 jul. 2019 om 21:22 heeft Wido den Hollander <w...@42on.com
> >     <mailto:w...@42on.com>> het volgende geschreven:
> >     >>
> >     >> Hi,
> >     >>
> >     >> Is anybody using 4x (size=4, min_size=2) replication with Ceph?
> >     >>
> >     >> The reason I'm asking is that a customer of mine asked me for a
> >     solution
> >     >> to prevent a situation which occurred:
> >     >>
> >     >> A cluster running with size=3 and replication over different
> >     racks was
> >     >> being upgraded from 13.2.5 to 13.2.6.
> >     >>
> >     >> During the upgrade, which involved patching the OS as well, they
> >     >> rebooted one of the nodes. During that reboot suddenly a node in a
> >     >> different rack rebooted. It was unclear why this happened, but
> >     the node
> >     >> was gone.
> >     >>
> >     >> While the upgraded node was rebooting and the other node crashed
> >     about
> >     >> 120 PGs were inactive due to min_size=2
> >     >>
> >     >> Waiting for the nodes to come back, recovery to finish it took
> >     about 15
> >     >> minutes before all VMs running inside OpenStack were back again.
> >     >>
> >     >> As you are upgraded or performing any maintenance with size=3 you
> >     can't
> >     >> tolerate a failure of a node as that will cause PGs to go
> inactive.
> >     >>
> >     >> This made me think about using size=4 and min_size=2 to prevent
> this
> >     >> situation.
> >     >>
> >     >> This obviously has implications on write latency and cost, but it
> >     would
> >     >> prevent such a situation.
> >     >>
> >     >> Is anybody here running a Ceph cluster with size=4 and min_size=2
> for
> >     >> this reason?
> >     >>
> >     >> Thank you,
> >     >>
> >     >> Wido
> >     >> _______________________________________________
> >     >> ceph-users mailing list
> >     >> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> >     >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >     _______________________________________________
> >     ceph-users mailing list
> >     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to