Yeah, the current situation with recovery and min_size is... unfortunate :(

The reason why min_size = k is bad is just that it means you are accepting
writes without guaranteeing durability while you are in a degraded state.
A durable storage system should never tell a client "okay, i've written
your data" if losing a single disk leads to data loss.

Yes, that is the default behavior of traditional raid 5 and raid 6 systems
during rebuild (with 1 or 2 disk failures for raid 5/6), but that doesn't
mean it's a good idea.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Mon, May 20, 2019 at 7:37 PM Frank Schilder <fr...@dtu.dk> wrote:

> This is an issue that is coming up every now and then (for example:
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg50415.html) and
> I would consider it a very serious one (I will give an example below). A
> statement like "min_size = k is unsafe and should never be set" deserves a
> bit more explanation, because ceph is the only storage system I know of,
> for which k+m redundancy does *not* mean "you can loose up to m disks and
> still have read-write access". If this is really true then, assuming the
> same redundancy level, loosing service (client access) is significantly
> more likely with ceph than with other storage systems. And this has impact
> on design and storage pricing.
>
> However, some help seems on the way and an, in my opinion, utterly
> important feature update seems almost finished:
> https://github.com/ceph/ceph/pull/17619 . It will implement the following:
>
> - recovery I/O happens as long as k shards are available (this is new)
> - client I/O will happen as long as min_size shards are available
> - recommended is min_size=k+1 (this might be wrong)
>
> This is pretty good and much better than the current behaviour (see
> below). This pull request also offers useful further information.
>
> Apparently, there is some kind of rare issue with erasure coding in ceph
> that makes it problematic to use min_size=k. I couldn't find anything
> better than vague explanations. Quote from the thread above: "Recovery on
> EC pools requires min_size rather than k shards at this time. There were
> reasons; they weren't great."
>
> This is actually a situation I was in. I once lost 2 failure domains
> simultaneously on an 8+2 EC pool and was really surprised that recovery
> stopped after some time with the worst degraded PGs remaining unfixed. I
> discovered the min_size=9 (instead of 8) and "ceph health detail"
> recommended to reduce min_size. Before doing so, I searched the web (I
> mean, why the default k+1? Come on, there must be a reason.) and found some
> vague hints about problems with min_size=k during rebuild. This is a really
> bad corner to be in. A lot of PGs are already critically degraded and the
> only way forward was to make a bad situation worse, because reducing
> min_size would immediately enable client I/O in addition to recovery I/O.
>
> It looks like the default of min_size=k+1 will stay, because min_size=k
> does have some rare issues and these seem not to disappear. (I hope I'm
> wrong though.) Hence, if min_size=k will remain problematic, the
> recommendation should be "never to use m=1" instead of "never use
> min_size=k". In other words, instead of using a 2+1 EC profile, one should
> use a 4+2 EC profile. If one would like to have secure write access for n
> disk losses, then m>=n+1.
>
> If this issue remains, in my opinion this should be taken up in the best
> practices section. In particular, the documentation should not use examples
> with m=1, this gives the wrong impression. Either min_size=k is safe or
> not. If it is not, it should never be used anywhere in the documentation.
>
> I hope I marked my opinions and hypotheses clearly and that the links are
> helpful. If anyone could shed some light on as to why exactly min_size=k+1
> is important, I would be grateful.
>
> Best regards,
>
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to