Re: [ceph-users] Anybody using 4x (size=4) replication?

Wido den Hollander Thu, 25 Jul 2019 02:01:03 -0700


On 7/25/19 9:56 AM, Xiaoxi Chen wrote:
> The real impact of changing min_size to 1 , is not about the possibility
> of losing data ,but how much data it will lost.. in both case you will
> lost some data , just how much.
> 
> Let PG X -> (osd A, B, C), min_size = 2, size =3
> In your description, 
> 
> T1,   OSD A goes down due to upgrade, now the PG is in degraded mode
> with   (B,C),  note that the PG is still active so that there is data
> only written to B and C.  
> 
> T2 , B goes down to due to disk failure.  C is the only one holding the
> portion of data between [T1, T2].
> The failure rate of C, in this situation  , is independent to whether we
> continue writing to C. 
> 
> if C failed in T3,  
>     w/o changing min_size, you  lost data from [T1, T2] together with
> data unavailable from [T2,T3]
>     changing min_size = 1 , you lost data from [T1, T3] 
> 
> But agree, it is a tradeoff,  depending on how you believe you wont have
> two drive failure in a row within 15 mins upgrade window...
>


Yes. So with min_size=1 you would need a single disk failure (B in your
example) to loose data.

If min_size=2 you would need both B and C to fail within that window or
when they are recovering A.

That is an even smaller chance than a single disk failure.

Wido

> Wido den Hollander <w...@42on.com <mailto:w...@42on.com>> 于2019年7月25
> 日周四 下午3:39写道：
> 
> 
> 
>     On 7/25/19 9:19 AM, Xiaoxi Chen wrote:
>     > We had hit this case in production but my solution will be change
>     > min_size = 1 immediately so that PG back to active right after.
>     >
>     > It somewhat tradeoff reliability(durability) with availability during
>     > that window of 15 mins but if you are certain one out of two "failure"
>     > is due to recoverable issue, it worth to do so.
>     >
> 
>     That's actually dangerous imho.
> 
>     Because while you set min_size=1 you will be mutating data on that
>     single disk/OSD.
> 
>     If the other two OSDs come back recovery will start. Now IF that single
>     disk/OSD now dies while performing the recovery you have lost data.
> 
>     The PG (or PGs) becomes inactive and you either need to perform data
>     recovery on the failed disk or revert back to the last state.
> 
>     I can't take that risk in this situation.
> 
>     Wido
> 
>     > My 0.02
>     >
>     > Wido den Hollander <w...@42on.com <mailto:w...@42on.com>
>     <mailto:w...@42on.com <mailto:w...@42on.com>>> 于2019年7月25
>     > 日周四 上午3:48写道：
>     >
>     >
>     >
>     >     On 7/24/19 9:35 PM, Mark Schouten wrote:
>     >     > I’d say the cure is worse than the issue you’re trying to
>     fix, but
>     >     that’s my two cents.
>     >     >
>     >
>     >     I'm not completely happy with it either. Yes, the price goes
>     up and
>     >     latency increases as well.
>     >
>     >     Right now I'm just trying to find a clever solution to this.
>     It's a 2k
>     >     OSD cluster and the likelihood of an host or OSD crashing is
>     reasonable
>     >     while you are performing maintenance on a different host.
>     >
>     >     All kinds of things have crossed my mind where using size=4 is one
>     >     of them.
>     >
>     >     Wido
>     >
>     >     > Mark Schouten
>     >     >
>     >     >> Op 24 jul. 2019 om 21:22 heeft Wido den Hollander
>     <w...@42on.com <mailto:w...@42on.com>
>     >     <mailto:w...@42on.com <mailto:w...@42on.com>>> het volgende
>     geschreven:
>     >     >>
>     >     >> Hi,
>     >     >>
>     >     >> Is anybody using 4x (size=4, min_size=2) replication with Ceph?
>     >     >>
>     >     >> The reason I'm asking is that a customer of mine asked me for a
>     >     solution
>     >     >> to prevent a situation which occurred:
>     >     >>
>     >     >> A cluster running with size=3 and replication over different
>     >     racks was
>     >     >> being upgraded from 13.2.5 to 13.2.6.
>     >     >>
>     >     >> During the upgrade, which involved patching the OS as well,
>     they
>     >     >> rebooted one of the nodes. During that reboot suddenly a
>     node in a
>     >     >> different rack rebooted. It was unclear why this happened, but
>     >     the node
>     >     >> was gone.
>     >     >>
>     >     >> While the upgraded node was rebooting and the other node
>     crashed
>     >     about
>     >     >> 120 PGs were inactive due to min_size=2
>     >     >>
>     >     >> Waiting for the nodes to come back, recovery to finish it took
>     >     about 15
>     >     >> minutes before all VMs running inside OpenStack were back
>     again.
>     >     >>
>     >     >> As you are upgraded or performing any maintenance with
>     size=3 you
>     >     can't
>     >     >> tolerate a failure of a node as that will cause PGs to go
>     inactive.
>     >     >>
>     >     >> This made me think about using size=4 and min_size=2 to
>     prevent this
>     >     >> situation.
>     >     >>
>     >     >> This obviously has implications on write latency and cost,
>     but it
>     >     would
>     >     >> prevent such a situation.
>     >     >>
>     >     >> Is anybody here running a Ceph cluster with size=4 and
>     min_size=2 for
>     >     >> this reason?
>     >     >>
>     >     >> Thank you,
>     >     >>
>     >     >> Wido
>     >     >> _______________________________________________
>     >     >> ceph-users mailing list
>     >     >> ceph-users@lists.ceph.com
>     <mailto:ceph-users@lists.ceph.com> <mailto:ceph-users@lists.ceph.com
>     <mailto:ceph-users@lists.ceph.com>>
>     >     >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >     _______________________________________________
>     >     ceph-users mailing list
>     >     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>     <mailto:ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>>
>     >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >
> 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Anybody using 4x (size=4) replication?

Reply via email to