Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?
I still don't understand why I get any clean PGs in the erasure-coded pool, when with two OSDs down there is no more redundancy, and therefore all PGs should be undersized (or so I think). I repeated the experiment by bringing two remaining OSDs online, and then killing them, and got results similar to the previous test. But this time I observed the process more closely. Example showing state changes for one of the PGs [OSD assignment in brackets]: When all OSDs were online: [3,2,4,0,1], state: active+clean Initially after OSDs 3 and 4 were killed: [x,2,x,0,1], state: active+undersized+degraded After some time (OSD 3 and 4 still offline): [0,2,0,0,1], state: active+clean ('x' means that some large number was listed; I assume this meant original OSD unavailable) Another PG: When all OSDs were online: [0,3,2,1,4], state: active+clean Initially after OSDs 3 and 4 were killed: [0,x,2,1,x], state: active+undersized+degraded After some time (OSD 3 and 4 still offline): [0,1,2,1,1], state: active+clean+remapped Note: This PG became remapped, the previous one did not. Does this mean that these PGs now have 5 chunks, of which 3 are stored on one OSD? Perhaps I am missing something, but could this arrangement be redundant? And how can a non-redundant state be considered clean? By the way, I am using crush-failure-domain=host, and I have one OSD per host. On the good side, I have no complaints about how the replicated metadata pool operates. Unfortunately, I will not be able to replicate data in my future production cluster. One more thing, I figured out that "degraded" means "undersized and contains data". Thanks Maciej Puzio On Wed, May 9, 2018 at 7:07 PM, Gregory Farnumwrote: > On Wed, May 9, 2018 at 4:37 PM, Maciej Puzio wrote: >> My setup consists of two pools on 5 OSDs, and is intended for cephfs: >> 1. erasure-coded data pool: k=3, m=2, size=5, min_size=3 (originally >> 4), number of PGs=128 >> 2. replicated metadata pool: size=3, min_size=2, number of PGs=100 >> >> When all OSDs were online, all PGs from both pools has status >> active+clean. After killing two of five OSDs (and changing min_size to >> 3), all metadata pool PGs remained active+clean, and of 128 data pool >> PGs, 3 remained active+clean, 11 became active+clean+remapped, and the >> rest became active+undersized, active+undersized+remapped, >> active+undersized+degraded or active+undersized+degraded+remapped, >> seemingly at random. >> >> After some time one of remaining three OSD nodes lost network >> connectivity (due to ceph-unrelated bug in virtio_net; this toy setup >> sure is becoming a bug motherlode!). The node was rebooted, ceph >> cluster become accessible again (with 3 out of 5 OSDs online, as >> before), and three active+clean data pool PGs now became >> active+clean+remapped, while the rest of PGs seems to have kept their >> previous status. > > That collection makes sense if you have a replicated pool as well as > an EC one, then. They represent different states for the PG; see > http://docs.ceph.com/docs/jewel/rados/operations/pg-states/ and > they're not random but the collection of which PG is in which set of > states is determined by how CRUSH placement and the failures interact, > and CRUSH is a pseudo-random algorithm, so... ;) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?
On Wed, May 9, 2018 at 4:37 PM, Maciej Puziowrote: > My setup consists of two pools on 5 OSDs, and is intended for cephfs: > 1. erasure-coded data pool: k=3, m=2, size=5, min_size=3 (originally > 4), number of PGs=128 > 2. replicated metadata pool: size=3, min_size=2, number of PGs=100 > > When all OSDs were online, all PGs from both pools has status > active+clean. After killing two of five OSDs (and changing min_size to > 3), all metadata pool PGs remained active+clean, and of 128 data pool > PGs, 3 remained active+clean, 11 became active+clean+remapped, and the > rest became active+undersized, active+undersized+remapped, > active+undersized+degraded or active+undersized+degraded+remapped, > seemingly at random. > > After some time one of remaining three OSD nodes lost network > connectivity (due to ceph-unrelated bug in virtio_net; this toy setup > sure is becoming a bug motherlode!). The node was rebooted, ceph > cluster become accessible again (with 3 out of 5 OSDs online, as > before), and three active+clean data pool PGs now became > active+clean+remapped, while the rest of PGs seems to have kept their > previous status. That collection makes sense if you have a replicated pool as well as an EC one, then. They represent different states for the PG; see http://docs.ceph.com/docs/jewel/rados/operations/pg-states/ and they're not random but the collection of which PG is in which set of states is determined by how CRUSH placement and the failures interact, and CRUSH is a pseudo-random algorithm, so... ;) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?
My setup consists of two pools on 5 OSDs, and is intended for cephfs: 1. erasure-coded data pool: k=3, m=2, size=5, min_size=3 (originally 4), number of PGs=128 2. replicated metadata pool: size=3, min_size=2, number of PGs=100 When all OSDs were online, all PGs from both pools has status active+clean. After killing two of five OSDs (and changing min_size to 3), all metadata pool PGs remained active+clean, and of 128 data pool PGs, 3 remained active+clean, 11 became active+clean+remapped, and the rest became active+undersized, active+undersized+remapped, active+undersized+degraded or active+undersized+degraded+remapped, seemingly at random. After some time one of remaining three OSD nodes lost network connectivity (due to ceph-unrelated bug in virtio_net; this toy setup sure is becoming a bug motherlode!). The node was rebooted, ceph cluster become accessible again (with 3 out of 5 OSDs online, as before), and three active+clean data pool PGs now became active+clean+remapped, while the rest of PGs seems to have kept their previous status. Thanks Maciej Puzio On Wed, May 9, 2018 at 4:49 PM, Gregory Farnumwrote: > active+clean does not make a lot of sense if every PG really was 3+2. But > perhaps you had a 3x replicated pool or something hanging out as well from > your deployment tool? > The active+clean+remapped means that a PG was somehow lucky enough to have > an existing "stray" copy on one of the OSDs that it has decided to use to > bring it back up to the right number of copies, even though they certainly > won't match the proper failure domains. > The min_size in relation to the k+m values won't have any direct impact > here, although they might indirectly affect it by changing how quickly stray > PGs get deleted. > -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?
On Tue, May 8, 2018 at 2:16 PM Maciej Puziowrote: > Thank you everyone for your replies. However, I feel that at least > part of the discussion deviated from the topic of my original post. As > I wrote before, I am dealing with a toy cluster, whose purpose is not > to provide a resilient storage, but to evaluate ceph and its behavior > in the event of a failure, with particular attention paid to > worst-case scenarios. This cluster is purposely minimal, and is built > on VMs running on my workstation, all OSDs storing data on a single > SSD. That's definitely not a production system. > > I am not asking for advice on how to build resilient clusters, not at > this point. I asked some questions about specific things that I > noticed during my tests, and that I was not able to find explained in > ceph documentation. Dan van der Ster wrote: > > See https://github.com/ceph/ceph/pull/8008 for the reason why min_size > defaults to k+1 on ec pools. > That's a good point, but I am wondering why are reads also blocked > when number of OSDs falls down to k? What if total number of OSDs in a > pool (n) is larger than k+m, should the min_size then be k(+1) or > n-m(+1)? > In any case, since min_size can be easily changed, then I guess this > is not an implementation issue, but rather a documentation issue. > > Which leaves these my questions still unanswered: > After killing m OSDs and setting min_size=k most of PGs were now > active+undersized, often with ...+degraded and/or remapped, but a few > were active+clean or active+clean+remapped. Why? I would expect all > PGs to be in the same state (perhaps active+undersized+degraded?). > Is this mishmash of PG states normal? If not, would I have avoided it > if I created the pool with min_size=k=3 from the start? In other > words, does min_size influence the assignment of PGs to OSDs? Or is it > only used to force I/O shutdown in the event of OSDs failures? > active+clean does not make a lot of sense if every PG really was 3+2. But perhaps you had a 3x replicated pool or something hanging out as well from your deployment tool? The active+clean+remapped means that a PG was somehow lucky enough to have an existing "stray" copy on one of the OSDs that it has decided to use to bring it back up to the right number of copies, even though they certainly won't match the proper failure domains. The min_size in relation to the k+m values won't have any direct impact here, although they might indirectly affect it by changing how quickly stray PGs get deleted. -Greg > > Thank you very much > > Maciej Puzio > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?
Thank you everyone for your replies. However, I feel that at least part of the discussion deviated from the topic of my original post. As I wrote before, I am dealing with a toy cluster, whose purpose is not to provide a resilient storage, but to evaluate ceph and its behavior in the event of a failure, with particular attention paid to worst-case scenarios. This cluster is purposely minimal, and is built on VMs running on my workstation, all OSDs storing data on a single SSD. That's definitely not a production system. I am not asking for advice on how to build resilient clusters, not at this point. I asked some questions about specific things that I noticed during my tests, and that I was not able to find explained in ceph documentation. Dan van der Ster wrote: > See https://github.com/ceph/ceph/pull/8008 for the reason why min_size > defaults to k+1 on ec pools. That's a good point, but I am wondering why are reads also blocked when number of OSDs falls down to k? What if total number of OSDs in a pool (n) is larger than k+m, should the min_size then be k(+1) or n-m(+1)? In any case, since min_size can be easily changed, then I guess this is not an implementation issue, but rather a documentation issue. Which leaves these my questions still unanswered: After killing m OSDs and setting min_size=k most of PGs were now active+undersized, often with ...+degraded and/or remapped, but a few were active+clean or active+clean+remapped. Why? I would expect all PGs to be in the same state (perhaps active+undersized+degraded?). Is this mishmash of PG states normal? If not, would I have avoided it if I created the pool with min_size=k=3 from the start? In other words, does min_size influence the assignment of PGs to OSDs? Or is it only used to force I/O shutdown in the event of OSDs failures? Thank you very much Maciej Puzio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?
On Tue, May 8, 2018 at 12:07 PM, Dan van der Sterwrote: > On Tue, May 8, 2018 at 7:35 PM, Vasu Kulkarni wrote: >> On Mon, May 7, 2018 at 2:26 PM, Maciej Puzio wrote: >>> I am an admin in a research lab looking for a cluster storage >>> solution, and a newbie to ceph. I have setup a mini toy cluster on >>> some VMs, to familiarize myself with ceph and to test failure >>> scenarios. I am using ceph 12.2.4 on Ubuntu 18.04. I created 5 OSDs >>> (one OSD per VM), an erasure-coded pool for data (k=3, m=2), a >>> replicated pool for metadata, and CephFS on top of them, using default >>> settings wherever possible. I mounted the filesystem on another >>> machine and verified that it worked. >>> >>> I then killed two OSD VMs with an expectation that the data pool will >>> still be available, even if in a degraded state, but I found that this >>> was not the case, and that the pool became inaccessible for reading >>> and writing. I listed PGs (ceph pg ls) and found the majority of PGs >>> in an incomplete state. I then found that the pool had size=5 and >>> min_size=4. Where did the value 4 come from, I do not know. >>> >>> This is what I found in the ceph documentation in relation to min_size >>> and resiliency of erasure-coded pools: >>> >>> 1. According to >>> http://docs.ceph.com/docs/luminous/rados/operations/pools/ the values >>> size and min_size are for replicated pools only. >>> 2. According to the same document, for erasure-coded pools the number >>> of OSDs that are allowed to fail without losing data equals the number >>> of coding chunks (m=2 in my case). Of course data loss is not the same >>> thing as lack of access, but why these two things happen at different >>> redundancy levels, by default? >>> 3. The same document states that that no object in the data pool will >>> receive I/O with fewer than min_size replicas. This refers to >>> replicas, and taken together with #1, appear not to apply to >>> erasured-coded pools. But in fact it does, and the default min_size != >>> k causes a surprising behavior. >>> 4. According to >>> http://docs.ceph.com/docs/master/rados/operations/pg-states/ , >>> reducing min_size may allow recovery of an erasure-coded pool. This >>> advice was deemed unhelpful and removed from documentation (commit >>> 9549943761d1cdc16d72e2b604bf1f89d12b5e13), but then re-added (commit >>> ac6123d7a6d27775eec0a152c00e0ff75b36bd60). I guess I am not the only >>> one confused. >> >> >> you bring up good inconsistency that needs to be addressed, afaik,only >> m value is important >> for ec pools, i am not sure if the *replicated* metadata pool is >> somehow causing min_size >> variance in your experiment to work. when we create replicated pool it >> has option for min size >> and for ec pool it is the m value. > > See https://github.com/ceph/ceph/pull/8008 for the reason why min_size > defaults to k+1 on ec pools. So this looks like its happening by default per ec pool, unless user is changing the pool min_size. probably this should be left unchanged and we could document it? It is bit confusing with coding chunks. > > Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?
On Tue, May 8, 2018 at 7:35 PM, Vasu Kulkarniwrote: > On Mon, May 7, 2018 at 2:26 PM, Maciej Puzio wrote: >> I am an admin in a research lab looking for a cluster storage >> solution, and a newbie to ceph. I have setup a mini toy cluster on >> some VMs, to familiarize myself with ceph and to test failure >> scenarios. I am using ceph 12.2.4 on Ubuntu 18.04. I created 5 OSDs >> (one OSD per VM), an erasure-coded pool for data (k=3, m=2), a >> replicated pool for metadata, and CephFS on top of them, using default >> settings wherever possible. I mounted the filesystem on another >> machine and verified that it worked. >> >> I then killed two OSD VMs with an expectation that the data pool will >> still be available, even if in a degraded state, but I found that this >> was not the case, and that the pool became inaccessible for reading >> and writing. I listed PGs (ceph pg ls) and found the majority of PGs >> in an incomplete state. I then found that the pool had size=5 and >> min_size=4. Where did the value 4 come from, I do not know. >> >> This is what I found in the ceph documentation in relation to min_size >> and resiliency of erasure-coded pools: >> >> 1. According to >> http://docs.ceph.com/docs/luminous/rados/operations/pools/ the values >> size and min_size are for replicated pools only. >> 2. According to the same document, for erasure-coded pools the number >> of OSDs that are allowed to fail without losing data equals the number >> of coding chunks (m=2 in my case). Of course data loss is not the same >> thing as lack of access, but why these two things happen at different >> redundancy levels, by default? >> 3. The same document states that that no object in the data pool will >> receive I/O with fewer than min_size replicas. This refers to >> replicas, and taken together with #1, appear not to apply to >> erasured-coded pools. But in fact it does, and the default min_size != >> k causes a surprising behavior. >> 4. According to >> http://docs.ceph.com/docs/master/rados/operations/pg-states/ , >> reducing min_size may allow recovery of an erasure-coded pool. This >> advice was deemed unhelpful and removed from documentation (commit >> 9549943761d1cdc16d72e2b604bf1f89d12b5e13), but then re-added (commit >> ac6123d7a6d27775eec0a152c00e0ff75b36bd60). I guess I am not the only >> one confused. > > > you bring up good inconsistency that needs to be addressed, afaik,only > m value is important > for ec pools, i am not sure if the *replicated* metadata pool is > somehow causing min_size > variance in your experiment to work. when we create replicated pool it > has option for min size > and for ec pool it is the m value. See https://github.com/ceph/ceph/pull/8008 for the reason why min_size defaults to k+1 on ec pools. Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?
On Mon, May 7, 2018 at 2:26 PM, Maciej Puziowrote: > I am an admin in a research lab looking for a cluster storage > solution, and a newbie to ceph. I have setup a mini toy cluster on > some VMs, to familiarize myself with ceph and to test failure > scenarios. I am using ceph 12.2.4 on Ubuntu 18.04. I created 5 OSDs > (one OSD per VM), an erasure-coded pool for data (k=3, m=2), a > replicated pool for metadata, and CephFS on top of them, using default > settings wherever possible. I mounted the filesystem on another > machine and verified that it worked. > > I then killed two OSD VMs with an expectation that the data pool will > still be available, even if in a degraded state, but I found that this > was not the case, and that the pool became inaccessible for reading > and writing. I listed PGs (ceph pg ls) and found the majority of PGs > in an incomplete state. I then found that the pool had size=5 and > min_size=4. Where did the value 4 come from, I do not know. > > This is what I found in the ceph documentation in relation to min_size > and resiliency of erasure-coded pools: > > 1. According to > http://docs.ceph.com/docs/luminous/rados/operations/pools/ the values > size and min_size are for replicated pools only. > 2. According to the same document, for erasure-coded pools the number > of OSDs that are allowed to fail without losing data equals the number > of coding chunks (m=2 in my case). Of course data loss is not the same > thing as lack of access, but why these two things happen at different > redundancy levels, by default? > 3. The same document states that that no object in the data pool will > receive I/O with fewer than min_size replicas. This refers to > replicas, and taken together with #1, appear not to apply to > erasured-coded pools. But in fact it does, and the default min_size != > k causes a surprising behavior. > 4. According to > http://docs.ceph.com/docs/master/rados/operations/pg-states/ , > reducing min_size may allow recovery of an erasure-coded pool. This > advice was deemed unhelpful and removed from documentation (commit > 9549943761d1cdc16d72e2b604bf1f89d12b5e13), but then re-added (commit > ac6123d7a6d27775eec0a152c00e0ff75b36bd60). I guess I am not the only > one confused. you bring up good inconsistency that needs to be addressed, afaik,only m value is important for ec pools, i am not sure if the *replicated* metadata pool is somehow causing min_size variance in your experiment to work. when we create replicated pool it has option for min size and for ec pool it is the m value. > > I followed the advice #4 and reduced min_size to 3. Lo and behold, the > pool became accessible, and I could read the data previously stored, > and write new one. This appears to contradict #1, but at least it > works. The look at ceph pg ls revealed another mystery, though. Most > of PGs were now active+undersized, often with ...+degraded and/or > remapped, but a few were active+clean or active+clean+remapped. Why? I > would expect all PGs to be in the same state (perhaps > active+undersized+degraded?) > > I apologize if this behavior turns out to be expected and > straightforward to experienced ceph users, or if I missed some > documentation that explains this clearly. My goal is to put about 500 > TB on ceph or another cluster storage system, and I find these issues > confusing and worrisome. Helpful and competent replies will be much > appreciated. Please note that my questions are about erasure-coded > pools, and not about replicated pools. > > Thank you > > Maciej Puzio > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?
You talked about "using default settings wherever possible"... Well, Ceph's default settings everywhere they exist, is to not allow you to write while you don't have at least 1 more copy that you can lose without data loss. If your bosses require you to be able to lose 2 servers and still serve customers, then tell them that Ceph requires you to have 3 parity copies of the data. Why do you want to change your one and only copy of the data while you already have a degraded system? And not just a degraded system, but a system where 2/5's of your servers are down... That sounds awful, terrible, and just plain bad. To directly answer your question about min_size, min_size does not affect where data is placed. It only affects when a PG claims to not have enough copies online to be able to receive read or write requests. On Tue, May 8, 2018 at 7:47 AM Janne Johanssonwrote: > 2018-05-08 1:46 GMT+02:00 Maciej Puzio : > >> Paul, many thanks for your reply. >> Thinking about it, I can't decide if I'd prefer to operate the storage >> server without redundancy, or have it automatically force a downtime, >> subjecting me to a rage of my users and my boss. >> But I think that the typical expectation is that system serves the >> data while it is able to do so. > > > If you want to prevent angry bosses, you would have made 10 OSD hosts > or some other large number so that ceph cloud place PGs over more places > so that 2 lost hosts would not impact so much, but also so it can recover > into > each PG into one of the 10 ( minus two broken minus the three that already > hold data you want to spread out) other OSDs and get back into full service > even with two lost hosts. > > It's fun to test assumptions and "how low can I go", but if you REALLY > wanted > a cluster with resilience to planned and unplanned maintenance, > you would have redundancy, just like that Raid6 disk box would > presumably have a fair amount of hot and perhaps cold spares nearby to kick > in if lots of disks started go missing. > > -- > May the most significant bit of your life be positive. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?
2018-05-08 1:46 GMT+02:00 Maciej Puzio: > Paul, many thanks for your reply. > Thinking about it, I can't decide if I'd prefer to operate the storage > server without redundancy, or have it automatically force a downtime, > subjecting me to a rage of my users and my boss. > But I think that the typical expectation is that system serves the > data while it is able to do so. If you want to prevent angry bosses, you would have made 10 OSD hosts or some other large number so that ceph cloud place PGs over more places so that 2 lost hosts would not impact so much, but also so it can recover into each PG into one of the 10 ( minus two broken minus the three that already hold data you want to spread out) other OSDs and get back into full service even with two lost hosts. It's fun to test assumptions and "how low can I go", but if you REALLY wanted a cluster with resilience to planned and unplanned maintenance, you would have redundancy, just like that Raid6 disk box would presumably have a fair amount of hot and perhaps cold spares nearby to kick in if lots of disks started go missing. -- May the most significant bit of your life be positive. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?
It's a very bad idea to accept data if you can't guarantee that it will be stored in way that tolerates a disk outage without data loss. Just don't. Increase the number of coding chunks to 3 if you want to withstand two simultaneous disk failures without impacting availability. Paul 2018-05-08 1:46 GMT+02:00 Maciej Puzio: > Paul, many thanks for your reply. > Thinking about it, I can't decide if I'd prefer to operate the storage > server without redundancy, or have it automatically force a downtime, > subjecting me to a rage of my users and my boss. > But I think that the typical expectation is that system serves the > data while it is able to do so. Since ceph by default does otherwise, > may I suggest that this is explained in the docs? As things are now, I > needed a trial-and-error approach to figure out why ceph was not > working in a setup that I think was hardly exotic, and in fact > resembled an ordinary RAID 6. > > Which leaves us with a mishmash of PG states. Is it normal? If not, > would I have avoided it if I created the pool with min_size=k=3 from > the start? In other words, does min_size influence the assignment of > PGs to OSDs? Or is it only used to force I/O shutdown in the event of > OSDs failures? > > Thank you very much > > Maciej Puzio > > > On Mon, May 7, 2018 at 5:00 PM, Paul Emmerich > wrote: > > The docs seem wrong here. min_size is available for erasure coded pools > and > > works like you'd expect it to work. > > Still, it's not a good idea to reduce it to the number of data chunks. > > > > > > Paul > > > > -- > > -- > > Paul Emmerich > > > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > > > croit GmbH > > Freseniusstr. 31h > > 81247 München > > www.croit.io > > Tel: +49 89 1896585 90 > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?
Paul, many thanks for your reply. Thinking about it, I can't decide if I'd prefer to operate the storage server without redundancy, or have it automatically force a downtime, subjecting me to a rage of my users and my boss. But I think that the typical expectation is that system serves the data while it is able to do so. Since ceph by default does otherwise, may I suggest that this is explained in the docs? As things are now, I needed a trial-and-error approach to figure out why ceph was not working in a setup that I think was hardly exotic, and in fact resembled an ordinary RAID 6. Which leaves us with a mishmash of PG states. Is it normal? If not, would I have avoided it if I created the pool with min_size=k=3 from the start? In other words, does min_size influence the assignment of PGs to OSDs? Or is it only used to force I/O shutdown in the event of OSDs failures? Thank you very much Maciej Puzio On Mon, May 7, 2018 at 5:00 PM, Paul Emmerichwrote: > The docs seem wrong here. min_size is available for erasure coded pools and > works like you'd expect it to work. > Still, it's not a good idea to reduce it to the number of data chunks. > > > Paul > > -- > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?
The docs seem wrong here. min_size is available for erasure coded pools and works like you'd expect it to work. Still, it's not a good idea to reduce it to the number of data chunks. Paul 2018-05-07 23:26 GMT+02:00 Maciej Puzio: > I am an admin in a research lab looking for a cluster storage > solution, and a newbie to ceph. I have setup a mini toy cluster on > some VMs, to familiarize myself with ceph and to test failure > scenarios. I am using ceph 12.2.4 on Ubuntu 18.04. I created 5 OSDs > (one OSD per VM), an erasure-coded pool for data (k=3, m=2), a > replicated pool for metadata, and CephFS on top of them, using default > settings wherever possible. I mounted the filesystem on another > machine and verified that it worked. > > I then killed two OSD VMs with an expectation that the data pool will > still be available, even if in a degraded state, but I found that this > was not the case, and that the pool became inaccessible for reading > and writing. I listed PGs (ceph pg ls) and found the majority of PGs > in an incomplete state. I then found that the pool had size=5 and > min_size=4. Where did the value 4 come from, I do not know. > > This is what I found in the ceph documentation in relation to min_size > and resiliency of erasure-coded pools: > > 1. According to > http://docs.ceph.com/docs/luminous/rados/operations/pools/ the values > size and min_size are for replicated pools only. > 2. According to the same document, for erasure-coded pools the number > of OSDs that are allowed to fail without losing data equals the number > of coding chunks (m=2 in my case). Of course data loss is not the same > thing as lack of access, but why these two things happen at different > redundancy levels, by default? > 3. The same document states that that no object in the data pool will > receive I/O with fewer than min_size replicas. This refers to > replicas, and taken together with #1, appear not to apply to > erasured-coded pools. But in fact it does, and the default min_size != > k causes a surprising behavior. > 4. According to > http://docs.ceph.com/docs/master/rados/operations/pg-states/ , > reducing min_size may allow recovery of an erasure-coded pool. This > advice was deemed unhelpful and removed from documentation (commit > 9549943761d1cdc16d72e2b604bf1f89d12b5e13), but then re-added (commit > ac6123d7a6d27775eec0a152c00e0ff75b36bd60). I guess I am not the only > one confused. > > I followed the advice #4 and reduced min_size to 3. Lo and behold, the > pool became accessible, and I could read the data previously stored, > and write new one. This appears to contradict #1, but at least it > works. The look at ceph pg ls revealed another mystery, though. Most > of PGs were now active+undersized, often with ...+degraded and/or > remapped, but a few were active+clean or active+clean+remapped. Why? I > would expect all PGs to be in the same state (perhaps > active+undersized+degraded?) > > I apologize if this behavior turns out to be expected and > straightforward to experienced ceph users, or if I missed some > documentation that explains this clearly. My goal is to put about 500 > TB on ceph or another cluster storage system, and I find these issues > confusing and worrisome. Helpful and competent replies will be much > appreciated. Please note that my questions are about erasure-coded > pools, and not about replicated pools. > > Thank you > > Maciej Puzio > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com