Yeah, the current situation with recovery and min_size is... unfortunate :(
The reason why min_size = k is bad is just that it means you are accepting writes without guaranteeing durability while you are in a degraded state. A durable storage system should never tell a client "okay, i've written your data" if losing a single disk leads to data loss. Yes, that is the default behavior of traditional raid 5 and raid 6 systems during rebuild (with 1 or 2 disk failures for raid 5/6), but that doesn't mean it's a good idea. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Mon, May 20, 2019 at 7:37 PM Frank Schilder <fr...@dtu.dk> wrote: > This is an issue that is coming up every now and then (for example: > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg50415.html) and > I would consider it a very serious one (I will give an example below). A > statement like "min_size = k is unsafe and should never be set" deserves a > bit more explanation, because ceph is the only storage system I know of, > for which k+m redundancy does *not* mean "you can loose up to m disks and > still have read-write access". If this is really true then, assuming the > same redundancy level, loosing service (client access) is significantly > more likely with ceph than with other storage systems. And this has impact > on design and storage pricing. > > However, some help seems on the way and an, in my opinion, utterly > important feature update seems almost finished: > https://github.com/ceph/ceph/pull/17619 . It will implement the following: > > - recovery I/O happens as long as k shards are available (this is new) > - client I/O will happen as long as min_size shards are available > - recommended is min_size=k+1 (this might be wrong) > > This is pretty good and much better than the current behaviour (see > below). This pull request also offers useful further information. > > Apparently, there is some kind of rare issue with erasure coding in ceph > that makes it problematic to use min_size=k. I couldn't find anything > better than vague explanations. Quote from the thread above: "Recovery on > EC pools requires min_size rather than k shards at this time. There were > reasons; they weren't great." > > This is actually a situation I was in. I once lost 2 failure domains > simultaneously on an 8+2 EC pool and was really surprised that recovery > stopped after some time with the worst degraded PGs remaining unfixed. I > discovered the min_size=9 (instead of 8) and "ceph health detail" > recommended to reduce min_size. Before doing so, I searched the web (I > mean, why the default k+1? Come on, there must be a reason.) and found some > vague hints about problems with min_size=k during rebuild. This is a really > bad corner to be in. A lot of PGs are already critically degraded and the > only way forward was to make a bad situation worse, because reducing > min_size would immediately enable client I/O in addition to recovery I/O. > > It looks like the default of min_size=k+1 will stay, because min_size=k > does have some rare issues and these seem not to disappear. (I hope I'm > wrong though.) Hence, if min_size=k will remain problematic, the > recommendation should be "never to use m=1" instead of "never use > min_size=k". In other words, instead of using a 2+1 EC profile, one should > use a 4+2 EC profile. If one would like to have secure write access for n > disk losses, then m>=n+1. > > If this issue remains, in my opinion this should be taken up in the best > practices section. In particular, the documentation should not use examples > with m=1, this gives the wrong impression. Either min_size=k is safe or > not. If it is not, it should never be used anywhere in the documentation. > > I hope I marked my opinions and hypotheses clearly and that the links are > helpful. If anyone could shed some light on as to why exactly min_size=k+1 > is important, I would be grateful. > > Best regards, > > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com