Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-09 Thread Maciej Puzio
I still don't understand why I get any clean PGs in the erasure-coded
pool, when with two OSDs down there is no more redundancy, and
therefore all PGs should be undersized (or so I think).
I repeated the experiment by bringing two remaining OSDs online, and
then killing them, and got results similar to the previous test. But
this time I observed the process more closely.

Example showing state changes for one of the PGs [OSD assignment in brackets]:
When all OSDs were online: [3,2,4,0,1], state: active+clean
Initially after OSDs 3 and 4 were killed: [x,2,x,0,1], state:
active+undersized+degraded
After some time (OSD 3 and 4 still offline): [0,2,0,0,1], state: active+clean
('x' means that some large number was listed; I assume this meant
original OSD unavailable)

Another PG:
When all OSDs were online: [0,3,2,1,4], state: active+clean
Initially after OSDs 3 and 4 were killed: [0,x,2,1,x], state:
active+undersized+degraded
After some time (OSD 3 and 4 still offline): [0,1,2,1,1], state:
active+clean+remapped
Note: This PG became remapped, the previous one did not.

Does this mean that these PGs now have 5 chunks, of which 3 are stored
on one OSD?
Perhaps I am missing something, but could this arrangement be
redundant? And how can a non-redundant state be considered clean?
By the way, I am using crush-failure-domain=host, and I have one OSD per host.

On the good side, I have no complaints about how the replicated
metadata pool operates. Unfortunately, I will not be able to replicate
data in my future production cluster.
One more thing, I figured out that "degraded" means "undersized and
contains data".

Thanks

Maciej Puzio


On Wed, May 9, 2018 at 7:07 PM, Gregory Farnum  wrote:
> On Wed, May 9, 2018 at 4:37 PM, Maciej Puzio  wrote:
>> My setup consists of two pools on 5 OSDs, and is intended for cephfs:
>> 1. erasure-coded data pool: k=3, m=2, size=5, min_size=3 (originally
>> 4), number of PGs=128
>> 2. replicated metadata pool: size=3, min_size=2, number of PGs=100
>>
>> When all OSDs were online, all PGs from both pools has status
>> active+clean. After killing two of five OSDs (and changing min_size to
>> 3), all metadata pool PGs remained active+clean, and of 128 data pool
>> PGs, 3 remained active+clean, 11 became active+clean+remapped, and the
>> rest became active+undersized, active+undersized+remapped,
>> active+undersized+degraded or active+undersized+degraded+remapped,
>> seemingly at random.
>>
>> After some time one of remaining three OSD nodes lost network
>> connectivity (due to ceph-unrelated bug in virtio_net; this toy setup
>> sure is becoming a bug motherlode!). The node was rebooted, ceph
>> cluster become accessible again (with 3 out of 5 OSDs online, as
>> before), and three active+clean data pool PGs now became
>> active+clean+remapped, while the rest of PGs seems to have kept their
>> previous status.
>
> That collection makes sense if you have a replicated pool as well as
> an EC one, then. They represent different states for the PG; see
> http://docs.ceph.com/docs/jewel/rados/operations/pg-states/ and
> they're not random but the collection of which PG is in which set of
> states is determined by how CRUSH placement and the failures interact,
> and CRUSH is a pseudo-random algorithm, so... ;)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-09 Thread Gregory Farnum
On Wed, May 9, 2018 at 4:37 PM, Maciej Puzio  wrote:
> My setup consists of two pools on 5 OSDs, and is intended for cephfs:
> 1. erasure-coded data pool: k=3, m=2, size=5, min_size=3 (originally
> 4), number of PGs=128
> 2. replicated metadata pool: size=3, min_size=2, number of PGs=100
>
> When all OSDs were online, all PGs from both pools has status
> active+clean. After killing two of five OSDs (and changing min_size to
> 3), all metadata pool PGs remained active+clean, and of 128 data pool
> PGs, 3 remained active+clean, 11 became active+clean+remapped, and the
> rest became active+undersized, active+undersized+remapped,
> active+undersized+degraded or active+undersized+degraded+remapped,
> seemingly at random.
>
> After some time one of remaining three OSD nodes lost network
> connectivity (due to ceph-unrelated bug in virtio_net; this toy setup
> sure is becoming a bug motherlode!). The node was rebooted, ceph
> cluster become accessible again (with 3 out of 5 OSDs online, as
> before), and three active+clean data pool PGs now became
> active+clean+remapped, while the rest of PGs seems to have kept their
> previous status.

That collection makes sense if you have a replicated pool as well as
an EC one, then. They represent different states for the PG; see
http://docs.ceph.com/docs/jewel/rados/operations/pg-states/ and
they're not random but the collection of which PG is in which set of
states is determined by how CRUSH placement and the failures interact,
and CRUSH is a pseudo-random algorithm, so... ;)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-09 Thread Maciej Puzio
My setup consists of two pools on 5 OSDs, and is intended for cephfs:
1. erasure-coded data pool: k=3, m=2, size=5, min_size=3 (originally
4), number of PGs=128
2. replicated metadata pool: size=3, min_size=2, number of PGs=100

When all OSDs were online, all PGs from both pools has status
active+clean. After killing two of five OSDs (and changing min_size to
3), all metadata pool PGs remained active+clean, and of 128 data pool
PGs, 3 remained active+clean, 11 became active+clean+remapped, and the
rest became active+undersized, active+undersized+remapped,
active+undersized+degraded or active+undersized+degraded+remapped,
seemingly at random.

After some time one of remaining three OSD nodes lost network
connectivity (due to ceph-unrelated bug in virtio_net; this toy setup
sure is becoming a bug motherlode!). The node was rebooted, ceph
cluster become accessible again (with 3 out of 5 OSDs online, as
before), and three active+clean data pool PGs now became
active+clean+remapped, while the rest of PGs seems to have kept their
previous status.

Thanks

Maciej Puzio


On Wed, May 9, 2018 at 4:49 PM, Gregory Farnum  wrote:
> active+clean does not make a lot of sense if every PG really was 3+2. But
> perhaps you had a 3x replicated pool or something hanging out as well from
> your deployment tool?
> The active+clean+remapped means that a PG was somehow lucky enough to have
> an existing "stray" copy on one of the OSDs that it has decided to use to
> bring it back up to the right number of copies, even though they certainly
> won't match the proper failure domains.
> The min_size in relation to the k+m values won't have any direct impact
> here, although they might indirectly affect it by changing how quickly stray
> PGs get deleted.
> -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-09 Thread Gregory Farnum
On Tue, May 8, 2018 at 2:16 PM Maciej Puzio  wrote:

> Thank you everyone for your replies. However, I feel that at least
> part of the discussion deviated from the topic of my original post. As
> I wrote before, I am dealing with a toy cluster, whose purpose is not
> to provide a resilient storage, but to evaluate ceph and its behavior
> in the event of a failure, with particular attention paid to
> worst-case scenarios. This cluster is purposely minimal, and is built
> on VMs running on my workstation, all OSDs storing data on a single
> SSD. That's definitely not a production system.
>
> I am not asking for advice on how to build resilient clusters, not at
> this point. I asked some questions about specific things that I
> noticed during my tests, and that I was not able to find explained in
> ceph documentation. Dan van der Ster wrote:
> > See https://github.com/ceph/ceph/pull/8008 for the reason why min_size
> defaults to k+1 on ec pools.
> That's a good point, but I am wondering why are reads also blocked
> when number of OSDs falls down to k? What if total number of OSDs in a
> pool (n) is larger than k+m, should the min_size then be k(+1) or
> n-m(+1)?
> In any case, since min_size can be easily changed, then I guess this
> is not an implementation issue, but rather a documentation issue.
>
> Which leaves these my questions still unanswered:
> After killing m OSDs and setting min_size=k most of PGs were now
> active+undersized, often with ...+degraded and/or remapped, but a few
> were active+clean or active+clean+remapped. Why? I would expect all
> PGs to be in the same state (perhaps active+undersized+degraded?).
> Is this mishmash of PG states normal? If not, would I have avoided it
> if I created the pool with min_size=k=3 from the start? In other
> words, does min_size influence the assignment of PGs to OSDs? Or is it
> only used to force I/O shutdown in the event of OSDs failures?
>

active+clean does not make a lot of sense if every PG really was 3+2. But
perhaps you had a 3x replicated pool or something hanging out as well from
your deployment tool?
The active+clean+remapped means that a PG was somehow lucky enough to have
an existing "stray" copy on one of the OSDs that it has decided to use to
bring it back up to the right number of copies, even though they certainly
won't match the proper failure domains.
The min_size in relation to the k+m values won't have any direct impact
here, although they might indirectly affect it by changing how quickly
stray PGs get deleted.
-Greg


>
> Thank you very much
>
> Maciej Puzio
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-08 Thread Maciej Puzio
Thank you everyone for your replies. However, I feel that at least
part of the discussion deviated from the topic of my original post. As
I wrote before, I am dealing with a toy cluster, whose purpose is not
to provide a resilient storage, but to evaluate ceph and its behavior
in the event of a failure, with particular attention paid to
worst-case scenarios. This cluster is purposely minimal, and is built
on VMs running on my workstation, all OSDs storing data on a single
SSD. That's definitely not a production system.

I am not asking for advice on how to build resilient clusters, not at
this point. I asked some questions about specific things that I
noticed during my tests, and that I was not able to find explained in
ceph documentation. Dan van der Ster wrote:
> See https://github.com/ceph/ceph/pull/8008 for the reason why min_size 
> defaults to k+1 on ec pools.
That's a good point, but I am wondering why are reads also blocked
when number of OSDs falls down to k? What if total number of OSDs in a
pool (n) is larger than k+m, should the min_size then be k(+1) or
n-m(+1)?
In any case, since min_size can be easily changed, then I guess this
is not an implementation issue, but rather a documentation issue.

Which leaves these my questions still unanswered:
After killing m OSDs and setting min_size=k most of PGs were now
active+undersized, often with ...+degraded and/or remapped, but a few
were active+clean or active+clean+remapped. Why? I would expect all
PGs to be in the same state (perhaps active+undersized+degraded?).
Is this mishmash of PG states normal? If not, would I have avoided it
if I created the pool with min_size=k=3 from the start? In other
words, does min_size influence the assignment of PGs to OSDs? Or is it
only used to force I/O shutdown in the event of OSDs failures?

Thank you very much

Maciej Puzio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-08 Thread Vasu Kulkarni
On Tue, May 8, 2018 at 12:07 PM, Dan van der Ster  wrote:
> On Tue, May 8, 2018 at 7:35 PM, Vasu Kulkarni  wrote:
>> On Mon, May 7, 2018 at 2:26 PM, Maciej Puzio  wrote:
>>> I am an admin in a research lab looking for a cluster storage
>>> solution, and a newbie to ceph. I have setup a mini toy cluster on
>>> some VMs, to familiarize myself with ceph and to test failure
>>> scenarios. I am using ceph 12.2.4 on Ubuntu 18.04. I created 5 OSDs
>>> (one OSD per VM), an erasure-coded pool for data (k=3, m=2), a
>>> replicated pool for metadata, and CephFS on top of them, using default
>>> settings wherever possible. I mounted the filesystem on another
>>> machine and verified that it worked.
>>>
>>> I then killed two OSD VMs with an expectation that the data pool will
>>> still be available, even if in a degraded state, but I found that this
>>> was not the case, and that the pool became inaccessible for reading
>>> and writing. I listed PGs (ceph pg ls) and found the majority of PGs
>>> in an incomplete state. I then found that the pool had size=5 and
>>> min_size=4. Where did the value 4 come from, I do not know.
>>>
>>> This is what I found in the ceph documentation in relation to min_size
>>> and resiliency of erasure-coded pools:
>>>
>>> 1. According to
>>> http://docs.ceph.com/docs/luminous/rados/operations/pools/ the values
>>> size and min_size are for replicated pools only.
>>> 2. According to the same document, for erasure-coded pools the number
>>> of OSDs that are allowed to fail without losing data equals the number
>>> of coding chunks (m=2 in my case). Of course data loss is not the same
>>> thing as lack of access, but why these two things happen at different
>>> redundancy levels, by default?
>>> 3. The same document states that that no object in the data pool will
>>> receive I/O with fewer than min_size replicas. This refers to
>>> replicas, and taken together with #1, appear not to apply to
>>> erasured-coded pools. But in fact it does, and the default min_size !=
>>> k causes a surprising behavior.
>>> 4. According to
>>> http://docs.ceph.com/docs/master/rados/operations/pg-states/ ,
>>> reducing min_size may allow recovery of an erasure-coded pool. This
>>> advice was deemed unhelpful and removed from documentation (commit
>>> 9549943761d1cdc16d72e2b604bf1f89d12b5e13), but then re-added (commit
>>> ac6123d7a6d27775eec0a152c00e0ff75b36bd60). I guess I am not the only
>>> one confused.
>>
>>
>> you bring up good inconsistency that needs to be addressed, afaik,only
>> m value is important
>> for ec pools, i am not sure if the *replicated* metadata pool is
>> somehow causing min_size
>> variance in your experiment to work. when we create replicated pool it
>> has option for min size
>> and for ec pool it is the m value.
>
> See https://github.com/ceph/ceph/pull/8008 for the reason why min_size
> defaults to k+1 on ec pools.

So this looks like its happening by default per ec pool, unless user
is changing the pool min_size.
 probably this should be left unchanged and we could document it? It
is bit confusing with
coding chunks.

>
> Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-08 Thread Dan van der Ster
On Tue, May 8, 2018 at 7:35 PM, Vasu Kulkarni  wrote:
> On Mon, May 7, 2018 at 2:26 PM, Maciej Puzio  wrote:
>> I am an admin in a research lab looking for a cluster storage
>> solution, and a newbie to ceph. I have setup a mini toy cluster on
>> some VMs, to familiarize myself with ceph and to test failure
>> scenarios. I am using ceph 12.2.4 on Ubuntu 18.04. I created 5 OSDs
>> (one OSD per VM), an erasure-coded pool for data (k=3, m=2), a
>> replicated pool for metadata, and CephFS on top of them, using default
>> settings wherever possible. I mounted the filesystem on another
>> machine and verified that it worked.
>>
>> I then killed two OSD VMs with an expectation that the data pool will
>> still be available, even if in a degraded state, but I found that this
>> was not the case, and that the pool became inaccessible for reading
>> and writing. I listed PGs (ceph pg ls) and found the majority of PGs
>> in an incomplete state. I then found that the pool had size=5 and
>> min_size=4. Where did the value 4 come from, I do not know.
>>
>> This is what I found in the ceph documentation in relation to min_size
>> and resiliency of erasure-coded pools:
>>
>> 1. According to
>> http://docs.ceph.com/docs/luminous/rados/operations/pools/ the values
>> size and min_size are for replicated pools only.
>> 2. According to the same document, for erasure-coded pools the number
>> of OSDs that are allowed to fail without losing data equals the number
>> of coding chunks (m=2 in my case). Of course data loss is not the same
>> thing as lack of access, but why these two things happen at different
>> redundancy levels, by default?
>> 3. The same document states that that no object in the data pool will
>> receive I/O with fewer than min_size replicas. This refers to
>> replicas, and taken together with #1, appear not to apply to
>> erasured-coded pools. But in fact it does, and the default min_size !=
>> k causes a surprising behavior.
>> 4. According to
>> http://docs.ceph.com/docs/master/rados/operations/pg-states/ ,
>> reducing min_size may allow recovery of an erasure-coded pool. This
>> advice was deemed unhelpful and removed from documentation (commit
>> 9549943761d1cdc16d72e2b604bf1f89d12b5e13), but then re-added (commit
>> ac6123d7a6d27775eec0a152c00e0ff75b36bd60). I guess I am not the only
>> one confused.
>
>
> you bring up good inconsistency that needs to be addressed, afaik,only
> m value is important
> for ec pools, i am not sure if the *replicated* metadata pool is
> somehow causing min_size
> variance in your experiment to work. when we create replicated pool it
> has option for min size
> and for ec pool it is the m value.

See https://github.com/ceph/ceph/pull/8008 for the reason why min_size
defaults to k+1 on ec pools.

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-08 Thread Vasu Kulkarni
On Mon, May 7, 2018 at 2:26 PM, Maciej Puzio  wrote:
> I am an admin in a research lab looking for a cluster storage
> solution, and a newbie to ceph. I have setup a mini toy cluster on
> some VMs, to familiarize myself with ceph and to test failure
> scenarios. I am using ceph 12.2.4 on Ubuntu 18.04. I created 5 OSDs
> (one OSD per VM), an erasure-coded pool for data (k=3, m=2), a
> replicated pool for metadata, and CephFS on top of them, using default
> settings wherever possible. I mounted the filesystem on another
> machine and verified that it worked.
>
> I then killed two OSD VMs with an expectation that the data pool will
> still be available, even if in a degraded state, but I found that this
> was not the case, and that the pool became inaccessible for reading
> and writing. I listed PGs (ceph pg ls) and found the majority of PGs
> in an incomplete state. I then found that the pool had size=5 and
> min_size=4. Where did the value 4 come from, I do not know.
>
> This is what I found in the ceph documentation in relation to min_size
> and resiliency of erasure-coded pools:
>
> 1. According to
> http://docs.ceph.com/docs/luminous/rados/operations/pools/ the values
> size and min_size are for replicated pools only.
> 2. According to the same document, for erasure-coded pools the number
> of OSDs that are allowed to fail without losing data equals the number
> of coding chunks (m=2 in my case). Of course data loss is not the same
> thing as lack of access, but why these two things happen at different
> redundancy levels, by default?
> 3. The same document states that that no object in the data pool will
> receive I/O with fewer than min_size replicas. This refers to
> replicas, and taken together with #1, appear not to apply to
> erasured-coded pools. But in fact it does, and the default min_size !=
> k causes a surprising behavior.
> 4. According to
> http://docs.ceph.com/docs/master/rados/operations/pg-states/ ,
> reducing min_size may allow recovery of an erasure-coded pool. This
> advice was deemed unhelpful and removed from documentation (commit
> 9549943761d1cdc16d72e2b604bf1f89d12b5e13), but then re-added (commit
> ac6123d7a6d27775eec0a152c00e0ff75b36bd60). I guess I am not the only
> one confused.


you bring up good inconsistency that needs to be addressed, afaik,only
m value is important
for ec pools, i am not sure if the *replicated* metadata pool is
somehow causing min_size
variance in your experiment to work. when we create replicated pool it
has option for min size
and for ec pool it is the m value.

>
> I followed the advice #4 and reduced min_size to 3. Lo and behold, the
> pool became accessible, and I could read the data previously stored,
> and write new one. This appears to contradict #1, but at least it
> works. The look at ceph pg ls revealed another mystery, though. Most
> of PGs were now active+undersized, often with ...+degraded and/or
> remapped, but a few were active+clean or active+clean+remapped. Why? I
> would expect all PGs to be in the same state (perhaps
> active+undersized+degraded?)
>
> I apologize if this behavior turns out to be expected and
> straightforward to experienced ceph users, or if I missed some
> documentation that explains this clearly. My goal is to put about 500
> TB on ceph or another cluster storage system, and I find these issues
> confusing and worrisome. Helpful and competent replies will be much
> appreciated. Please note that my questions are about erasure-coded
> pools, and not about replicated pools.
>
> Thank you
>
> Maciej Puzio
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-08 Thread David Turner
You talked about "using default settings wherever possible"... Well, Ceph's
default settings everywhere they exist, is to not allow you to write while
you don't have at least 1 more copy that you can lose without data loss.
If your bosses require you to be able to lose 2 servers and still serve
customers, then tell them that Ceph requires you to have 3 parity copies of
the data.

Why do you want to change your one and only copy of the data while you
already have a degraded system?  And not just a degraded system, but a
system where 2/5's of your servers are down... That sounds awful, terrible,
and just plain bad.

To directly answer your question about min_size, min_size does not affect
where data is placed.  It only affects when a PG claims to not have enough
copies online to be able to receive read or write requests.

On Tue, May 8, 2018 at 7:47 AM Janne Johansson  wrote:

> 2018-05-08 1:46 GMT+02:00 Maciej Puzio :
>
>> Paul, many thanks for your reply.
>> Thinking about it, I can't decide if I'd prefer to operate the storage
>> server without redundancy, or have it automatically force a downtime,
>> subjecting me to a rage of my users and my boss.
>> But I think that the typical expectation is that system serves the
>> data while it is able to do so.
>
>
> If you want to prevent angry bosses, you would have made 10 OSD hosts
> or some other large number so that ceph cloud place PGs over more places
> so that 2 lost hosts would not impact so much, but also so it can recover
> into
> each PG into one of the 10 ( minus two broken minus the three that already
> hold data you want to spread out) other OSDs and get back into full service
> even with two lost hosts.
>
> It's fun to test assumptions and "how low can I go", but if you REALLY
> wanted
> a cluster with resilience to planned and unplanned maintenance,
> you would have redundancy, just like that Raid6 disk box would
> presumably have a fair amount of hot and perhaps cold spares nearby to kick
> in if lots of disks started go missing.
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-08 Thread Janne Johansson
2018-05-08 1:46 GMT+02:00 Maciej Puzio :

> Paul, many thanks for your reply.
> Thinking about it, I can't decide if I'd prefer to operate the storage
> server without redundancy, or have it automatically force a downtime,
> subjecting me to a rage of my users and my boss.
> But I think that the typical expectation is that system serves the
> data while it is able to do so.


If you want to prevent angry bosses, you would have made 10 OSD hosts
or some other large number so that ceph cloud place PGs over more places
so that 2 lost hosts would not impact so much, but also so it can recover
into
each PG into one of the 10 ( minus two broken minus the three that already
hold data you want to spread out) other OSDs and get back into full service
even with two lost hosts.

It's fun to test assumptions and "how low can I go", but if you REALLY
wanted
a cluster with resilience to planned and unplanned maintenance,
you would have redundancy, just like that Raid6 disk box would
presumably have a fair amount of hot and perhaps cold spares nearby to kick
in if lots of disks started go missing.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-08 Thread Paul Emmerich
It's a very bad idea to accept data if you can't guarantee that it will be
stored in way that tolerates a disk outage
without data loss. Just don't.

Increase the number of coding chunks to 3 if you want to withstand two
simultaneous disk
failures without impacting availability.


Paul


2018-05-08 1:46 GMT+02:00 Maciej Puzio :

> Paul, many thanks for your reply.
> Thinking about it, I can't decide if I'd prefer to operate the storage
> server without redundancy, or have it automatically force a downtime,
> subjecting me to a rage of my users and my boss.
> But I think that the typical expectation is that system serves the
> data while it is able to do so. Since ceph by default does otherwise,
> may I suggest that this is explained in the docs? As things are now, I
> needed a trial-and-error approach to figure out why ceph was not
> working in a setup that I think was hardly exotic, and in fact
> resembled an ordinary RAID 6.
>
> Which leaves us with a mishmash of PG states. Is it normal? If not,
> would I have avoided it if I created the pool with min_size=k=3 from
> the start? In other words, does min_size influence the assignment of
> PGs to OSDs? Or is it only used to force I/O shutdown in the event of
> OSDs failures?
>
> Thank you very much
>
> Maciej Puzio
>
>
> On Mon, May 7, 2018 at 5:00 PM, Paul Emmerich 
> wrote:
> > The docs seem wrong here. min_size is available for erasure coded pools
> and
> > works like you'd expect it to work.
> > Still, it's not a good idea to reduce it to the number of data chunks.
> >
> >
> > Paul
> >
> > --
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io
> > Tel: +49 89 1896585 90
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-07 Thread Maciej Puzio
Paul, many thanks for your reply.
Thinking about it, I can't decide if I'd prefer to operate the storage
server without redundancy, or have it automatically force a downtime,
subjecting me to a rage of my users and my boss.
But I think that the typical expectation is that system serves the
data while it is able to do so. Since ceph by default does otherwise,
may I suggest that this is explained in the docs? As things are now, I
needed a trial-and-error approach to figure out why ceph was not
working in a setup that I think was hardly exotic, and in fact
resembled an ordinary RAID 6.

Which leaves us with a mishmash of PG states. Is it normal? If not,
would I have avoided it if I created the pool with min_size=k=3 from
the start? In other words, does min_size influence the assignment of
PGs to OSDs? Or is it only used to force I/O shutdown in the event of
OSDs failures?

Thank you very much

Maciej Puzio


On Mon, May 7, 2018 at 5:00 PM, Paul Emmerich  wrote:
> The docs seem wrong here. min_size is available for erasure coded pools and
> works like you'd expect it to work.
> Still, it's not a good idea to reduce it to the number of data chunks.
>
>
> Paul
>
> --
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-07 Thread Paul Emmerich
The docs seem wrong here. min_size is available for erasure coded pools and
works like you'd expect it to work.
Still, it's not a good idea to reduce it to the number of data chunks.


Paul

2018-05-07 23:26 GMT+02:00 Maciej Puzio :

> I am an admin in a research lab looking for a cluster storage
> solution, and a newbie to ceph. I have setup a mini toy cluster on
> some VMs, to familiarize myself with ceph and to test failure
> scenarios. I am using ceph 12.2.4 on Ubuntu 18.04. I created 5 OSDs
> (one OSD per VM), an erasure-coded pool for data (k=3, m=2), a
> replicated pool for metadata, and CephFS on top of them, using default
> settings wherever possible. I mounted the filesystem on another
> machine and verified that it worked.
>
> I then killed two OSD VMs with an expectation that the data pool will
> still be available, even if in a degraded state, but I found that this
> was not the case, and that the pool became inaccessible for reading
> and writing. I listed PGs (ceph pg ls) and found the majority of PGs
> in an incomplete state. I then found that the pool had size=5 and
> min_size=4. Where did the value 4 come from, I do not know.
>
> This is what I found in the ceph documentation in relation to min_size
> and resiliency of erasure-coded pools:
>
> 1. According to
> http://docs.ceph.com/docs/luminous/rados/operations/pools/ the values
> size and min_size are for replicated pools only.
> 2. According to the same document, for erasure-coded pools the number
> of OSDs that are allowed to fail without losing data equals the number
> of coding chunks (m=2 in my case). Of course data loss is not the same
> thing as lack of access, but why these two things happen at different
> redundancy levels, by default?
> 3. The same document states that that no object in the data pool will
> receive I/O with fewer than min_size replicas. This refers to
> replicas, and taken together with #1, appear not to apply to
> erasured-coded pools. But in fact it does, and the default min_size !=
> k causes a surprising behavior.
> 4. According to
> http://docs.ceph.com/docs/master/rados/operations/pg-states/ ,
> reducing min_size may allow recovery of an erasure-coded pool. This
> advice was deemed unhelpful and removed from documentation (commit
> 9549943761d1cdc16d72e2b604bf1f89d12b5e13), but then re-added (commit
> ac6123d7a6d27775eec0a152c00e0ff75b36bd60). I guess I am not the only
> one confused.
>
> I followed the advice #4 and reduced min_size to 3. Lo and behold, the
> pool became accessible, and I could read the data previously stored,
> and write new one. This appears to contradict #1, but at least it
> works. The look at ceph pg ls revealed another mystery, though. Most
> of PGs were now active+undersized, often with ...+degraded and/or
> remapped, but a few were active+clean or active+clean+remapped. Why? I
> would expect all PGs to be in the same state (perhaps
> active+undersized+degraded?)
>
> I apologize if this behavior turns out to be expected and
> straightforward to experienced ceph users, or if I missed some
> documentation that explains this clearly. My goal is to put about 500
> TB on ceph or another cluster storage system, and I find these issues
> confusing and worrisome. Helpful and competent replies will be much
> appreciated. Please note that my questions are about erasure-coded
> pools, and not about replicated pools.
>
> Thank you
>
> Maciej Puzio
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com