Re: [ceph-users] risk mitigation in 2 replica clusters

David Turner Wed, 21 Jun 2017 09:48:18 -0700

I disagree that Replica 2 will ever truly be sane if you care about your
data.  The biggest issue with replica 2 has nothing to do with drive
failures, restarting osds/nodes, power outages, etc.  The biggest issue
with replica 2 is the min_size.  If you set min_size to 2, then your data
is locked if you have any copy of the data unavailable.  That's fine since
you were probably going to set min_size to 1... which you should never do
ever unless you don't care about your data.

Too many pronouns, so we're going to say disk 1 and disk 2 are in charge of
a pg and the only 2 disks with a copy of the data.
The problem with a min_size of 1 is that if for any reason disk 1 is
inaccessible and a write is made to disk 2, then before disk 1 is fully
backfilled and caught up on all of the writes, disk 2 goes down... well now
your data is inaccessible, but that's not the issue.  The issue is when
disk 1 comes back up first and the client tries to access the data that it
wrote earlier to disk 2... except the data isn't there.  The client is
probably just showing an error somewhere and continuing.  Now it makes some
writes to disk 1 before disk 2 finishes coming back up.  What can these 2
disks possibly do to ensure that your data is consistent when both of them
are back up?

Now of course we reach THE QUESTION... How likely is this to ever happen
and what sort of things could cause it if not disk failures or performing
maintenance on your cluster?  The answer to that is more common than you'd
like to think.  Does your environment have enough RAM in your OSD nodes to
adequately handle recovery and not cycle into an OOM killer scenario?  Will
you ever hit a bug in the code that causes an operation to a PG to segfault
an OSD?  Those are both things that have happened to multiple clusters I've
managed and read about on the ML in the last year.  A min_size of 1 would
very likely lead to data loss in either situation regardless of power
failures and disk failures.

Now let's touch back on disk failures.  While backfilling due to adding
storage, removing storage, or just balancing your cluster you are much more
likely to lose drives.  During normal operation in a cluster, I would lose
about 6 drives in a year (2000+ OSDs).  During backfilling (especially
adding multiple storage nodes), I would lose closer to 1-3 drives per major
backfilling operation.

People keep asking about 2 replicas.  People keep saying it's going to be
viable with bluestore.  I care about my data too much to ever consider it.
If I was running a cluster where data loss was acceptable, then I would
absolutely consider it.  If you're thinking about 5 nines of uptime, then 2
replica will achieve that.  If you're talking about 100% data integrity,
then 2 replica is not AND WILL NEVER BE for you (no matter what the release
docs say about bluestore).  If space is your concern, start looking into
Erasure Coding.   You can save more space and increase redundancy for the
cost of some performance.

On Wed, Jun 21, 2017 at 10:56 AM <c...@jack.fr.eu.org> wrote:

> 2r on filestore == "I do not care about my data"
>
> This is not because of OSD's failure chance
>
> When you have a write error (ie data is badly written on the disk,
> without error reported), your data is just corrupted without hope of
> redemption
>
> Just as you expect your drives to die, expect your drives to "fail
> silently"
>
> With replica 3 and beyond, data CAN be repaired using quorum
>
> Replica 2 will become sane is the next release, with bluestore, which
> uses data checksums
>
> On 21/06/2017 16:51, Blair Bethwaite wrote:
> > Hi all,
> >
> > I'm doing some work to evaluate the risks involved in running 2r storage
> > pools. On the face of it my naive disk failure calculations give me 4-5
> > nines for a 2r pool of 100 OSDs (no copyset awareness, i.e., secondary
> disk
> > failure based purely on chance of any 1 of the remaining 99 OSDs failing
> > within recovery time). 5 nines is just fine for our purposes, but of
> course
> > multiple disk failures are only part of the story.
> >
> > The more problematic issue with 2r clusters is that any time you do
> planned
> > maintenance (our clusters spend much more time degraded because of
> regular
> > upkeep than because of real failures) you're suddenly drastically
> > increasing the risk of data-loss. So I find myself wondering if there is
> a
> > way to tell Ceph I want an extra replica created for a particular PG or
> set
> > thereof, e.g., something that would enable the functional equivalent of:
> > "this OSD/node is going to go offline so please create a 3rd replica in
> > every PG it is participating in before we shutdown that/those OSD/s"...?
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] risk mitigation in 2 replica clusters

Reply via email to