The real impact of changing min_size to 1 , is not about the possibility of losing data ,but how much data it will lost.. in both case you will lost some data , just how much.
Let PG X -> (osd A, B, C), min_size = 2, size =3 In your description, T1, OSD A goes down due to upgrade, now the PG is in degraded mode with (B,C), note that the PG is still active so that there is data only written to B and C. T2 , B goes down to due to disk failure. C is the only one holding the portion of data between [T1, T2]. The failure rate of C, in this situation , is independent to whether we continue writing to C. if C failed in T3, w/o changing min_size, you lost data from [T1, T2] together with data unavailable from [T2,T3] changing min_size = 1 , you lost data from [T1, T3] But agree, it is a tradeoff, depending on how you believe you wont have two drive failure in a row within 15 mins upgrade window... Wido den Hollander <w...@42on.com> 于2019年7月25日周四 下午3:39写道: > > > On 7/25/19 9:19 AM, Xiaoxi Chen wrote: > > We had hit this case in production but my solution will be change > > min_size = 1 immediately so that PG back to active right after. > > > > It somewhat tradeoff reliability(durability) with availability during > > that window of 15 mins but if you are certain one out of two "failure" > > is due to recoverable issue, it worth to do so. > > > > That's actually dangerous imho. > > Because while you set min_size=1 you will be mutating data on that > single disk/OSD. > > If the other two OSDs come back recovery will start. Now IF that single > disk/OSD now dies while performing the recovery you have lost data. > > The PG (or PGs) becomes inactive and you either need to perform data > recovery on the failed disk or revert back to the last state. > > I can't take that risk in this situation. > > Wido > > > My 0.02 > > > > Wido den Hollander <w...@42on.com <mailto:w...@42on.com>> 于2019年7月25 > > 日周四 上午3:48写道: > > > > > > > > On 7/24/19 9:35 PM, Mark Schouten wrote: > > > I’d say the cure is worse than the issue you’re trying to fix, but > > that’s my two cents. > > > > > > > I'm not completely happy with it either. Yes, the price goes up and > > latency increases as well. > > > > Right now I'm just trying to find a clever solution to this. It's a > 2k > > OSD cluster and the likelihood of an host or OSD crashing is > reasonable > > while you are performing maintenance on a different host. > > > > All kinds of things have crossed my mind where using size=4 is one > > of them. > > > > Wido > > > > > Mark Schouten > > > > > >> Op 24 jul. 2019 om 21:22 heeft Wido den Hollander <w...@42on.com > > <mailto:w...@42on.com>> het volgende geschreven: > > >> > > >> Hi, > > >> > > >> Is anybody using 4x (size=4, min_size=2) replication with Ceph? > > >> > > >> The reason I'm asking is that a customer of mine asked me for a > > solution > > >> to prevent a situation which occurred: > > >> > > >> A cluster running with size=3 and replication over different > > racks was > > >> being upgraded from 13.2.5 to 13.2.6. > > >> > > >> During the upgrade, which involved patching the OS as well, they > > >> rebooted one of the nodes. During that reboot suddenly a node in a > > >> different rack rebooted. It was unclear why this happened, but > > the node > > >> was gone. > > >> > > >> While the upgraded node was rebooting and the other node crashed > > about > > >> 120 PGs were inactive due to min_size=2 > > >> > > >> Waiting for the nodes to come back, recovery to finish it took > > about 15 > > >> minutes before all VMs running inside OpenStack were back again. > > >> > > >> As you are upgraded or performing any maintenance with size=3 you > > can't > > >> tolerate a failure of a node as that will cause PGs to go > inactive. > > >> > > >> This made me think about using size=4 and min_size=2 to prevent > this > > >> situation. > > >> > > >> This obviously has implications on write latency and cost, but it > > would > > >> prevent such a situation. > > >> > > >> Is anybody here running a Ceph cluster with size=4 and min_size=2 > for > > >> this reason? > > >> > > >> Thank you, > > >> > > >> Wido > > >> _______________________________________________ > > >> ceph-users mailing list > > >> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com