Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-28 Thread Mike Dawson


On 8/28/2014 4:17 PM, Craig Lewis wrote:

My initial experience was similar to Mike's, causing a similar level of
paranoia.  :-)  I'm dealing with RadosGW though, so I can tolerate
higher latencies.

I was running my cluster with noout and nodown set for weeks at a time.


I'm sure Craig will agree, but wanted to add this for other readers:

I find value in the noout flag for temporary intervention, but prefer to 
set "mon osd down out interval" for dealing with events that may occur 
in the future to give an operator time to intervene.


The nodown flag is another beast altogether. The nodown flag tends to be 
*a bad thing* when attempting to provide reliable client io. For our use 
case, we want OSDs to be marked down quickly if they are in fact 
unavailable for any reason, so client io doesn't hang waiting for them.


If OSDs are flapping during recovery (i.e. the "wrongly marked me down" 
log messages), I've found far superior results by tuning the recovery 
knobs than by permanently setting the nodown flag.


- Mike



  Recovery of a single OSD might cause other OSDs to crash. In the
primary cluster, I was always able to get it under control before it
cascaded too wide.  In my secondary cluster, it did spiral out to 40% of
the OSDs, with 2-5 OSDs down at any time.






I traced my problems to a combination of osd max backfills was too high
for my cluster, and my mkfs.xfs arguments were causing memory starvation
issues.  I lowered osd max backfills, added SSD journals,
and reformatted every OSD with better mkfs.xfs arguments.  Now both
clusters are stable, and I don't want to break it.

I only have 45 OSDs, so the risk with a 24-48 hours recovery time is
acceptable to me.  It will be a problem as I scale up, but scaling up
will also help with the latency problems.




On Thu, Aug 28, 2014 at 10:38 AM, Mike Dawson mailto:mike.daw...@cloudapt.com>> wrote:


We use 3x replication and have drives that have relatively high
steady-state IOPS. Therefore, we tend to prioritize client-side IO
more than a reduction from 3 copies to 2 during the loss of one
disk. The disruption to client io is so great on our cluster, we
don't want our cluster to be in a recovery state without
operator-supervision.

Letting OSDs get marked out without operator intervention was a
disaster in the early going of our cluster. For example, an OSD
daemon crash would trigger automatic recovery where it was unneeded.
Ironically, often times the unneeded recovery would often trigger
additional daemons to crash, making a bad situation worse. During
the recovery, rbd client io would often times go to 0.

To deal with this issue, we set "mon osd down out interval = 14400",
so as operators we have 4 hours to intervene before Ceph attempts to
self-heal. When hardware is at fault, we remove the osd, replace the
drive, re-add the osd, then allow backfill to begin, thereby
completely skipping step B in your timeline above.

- Mike



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-28 Thread Craig Lewis
My initial experience was similar to Mike's, causing a similar level of
paranoia.  :-)  I'm dealing with RadosGW though, so I can tolerate higher
latencies.

I was running my cluster with noout and nodown set for weeks at a time.
 Recovery of a single OSD might cause other OSDs to crash.  In the primary
cluster, I was always able to get it under control before it cascaded too
wide.  In my secondary cluster, it did spiral out to 40% of the OSDs, with
2-5 OSDs down at any time.

I traced my problems to a combination of osd max backfills was too high for
my cluster, and my mkfs.xfs arguments were causing memory starvation
issues.  I lowered osd max backfills, added SSD journals, and reformatted
every OSD with better mkfs.xfs arguments.  Now both clusters are stable,
and I don't want to break it.

I only have 45 OSDs, so the risk with a 24-48 hours recovery time is
acceptable to me.  It will be a problem as I scale up, but scaling up will
also help with the latency problems.




On Thu, Aug 28, 2014 at 10:38 AM, Mike Dawson 
wrote:

>
> We use 3x replication and have drives that have relatively high
> steady-state IOPS. Therefore, we tend to prioritize client-side IO more
> than a reduction from 3 copies to 2 during the loss of one disk. The
> disruption to client io is so great on our cluster, we don't want our
> cluster to be in a recovery state without operator-supervision.
>
> Letting OSDs get marked out without operator intervention was a disaster
> in the early going of our cluster. For example, an OSD daemon crash would
> trigger automatic recovery where it was unneeded. Ironically, often times
> the unneeded recovery would often trigger additional daemons to crash,
> making a bad situation worse. During the recovery, rbd client io would
> often times go to 0.
>
> To deal with this issue, we set "mon osd down out interval = 14400", so as
> operators we have 4 hours to intervene before Ceph attempts to self-heal.
> When hardware is at fault, we remove the osd, replace the drive, re-add the
> osd, then allow backfill to begin, thereby completely skipping step B in
> your timeline above.
>
> - Mike
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-28 Thread Mike Dawson

On 8/28/2014 11:17 AM, Loic Dachary wrote:



On 28/08/2014 16:29, Mike Dawson wrote:

On 8/28/2014 12:23 AM, Christian Balzer wrote:

On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote:




On 27/08/2014 04:34, Christian Balzer wrote:


Hello,

On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:


Hi Craig,

I assume the reason for the 48 hours recovery time is to keep the cost
of the cluster low ? I wrote "1h recovery time" because it is roughly
the time it would take to move 4TB over a 10Gb/s link. Could you
upgrade your hardware to reduce the recovery time to less than two
hours ? Or are there factors other than cost that prevent this ?



I doubt Craig is operating on a shoestring budget.
And even if his network were to be just GbE, that would still make it
only 10 hours according to your wishful thinking formula.

He probably has set the max_backfills to 1 because that is the level of
I/O his OSDs can handle w/o degrading cluster performance too much.
The network is unlikely to be the limiting factor.

The way I see it most Ceph clusters are in sort of steady state when
operating normally, i.e. a few hundred VM RBD images ticking over, most
actual OSD disk ops are writes, as nearly all hot objects that are
being read are in the page cache of the storage nodes.
Easy peasy.

Until something happens that breaks this routine, like a deep scrub,
all those VMs rebooting at the same time or a backfill caused by a
failed OSD. Now all of a sudden client ops compete with the backfill
ops, page caches are no longer hot, the spinners are seeking left and
right. Pandemonium.

I doubt very much that even with a SSD backed cluster you would get
away with less than 2 hours for 4TB.

To give you some real life numbers, I currently am building a new
cluster but for the time being have only one storage node to play with.
It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8
actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.

So I took out one OSD (reweight 0 first, then the usual removal steps)
because the actual disk was wonky. Replaced the disk and re-added the
OSD. Both operations took about the same time, 4 minutes for
evacuating the OSD (having 7 write targets clearly helped) for measly
12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the
OSD. And that is on one node (thus no network latency) that has the
default parameters (so a max_backfill of 10) which was otherwise
totally idle.

In other words, in this pretty ideal case it would have taken 22 hours
to re-distribute 4TB.


That makes sense to me :-)

When I wrote 1h, I thought about what happens when an OSD becomes
unavailable with no planning in advance. In the scenario you describe
the risk of a data loss does not increase since the objects are evicted
gradually from the disk being decommissioned and the number of replica
stays the same at all times. There is not a sudden drop in the number of
replica  which is what I had in mind.


That may be, but I'm rather certain that there is no difference in speed
and priority of a rebalancing caused by an OSD set to weight 0 or one
being set out.


If the lost OSD was part of 100 PG, the other disks (let say 50 of them)
will start transferring a new replica of the objects they have to the
new OSD in their PG. The replacement will not be a single OSD although
nothing prevents the same OSD to be used in more than one PG as a
replacement for the lost one. If the cluster network is connected at
10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new
duplicates do not originate from a single OSD but from at least dozens
of them and since they target more than one OSD, I assume we can expect
an actual throughput of 5Gb/s. I should have written 2h instead of 1h to
account for the fact that the cluster network is never idle.

Am I being too optimistic ?

Vastly.


Do you see another blocking factor that
would significantly slow down recovery ?


As Craig and I keep telling you, the network is not the limiting factor.
Concurrent disk IO is, as I pointed out in the other thread.


Completely agree.

On a production cluster with OSDs backed by spindles, even with OSD journals on 
SSDs, it is insufficient to calculate single-disk replacement backfill time 
based solely on network throughput. IOPS will likely be the limiting factor 
when backfilling a single failed spinner in a production cluster.

Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd 
cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio of 
3:1), with dual 1GbE bonded NICs.

Using the only throughput math, backfill could have theoretically completed in 
a bit over 2.5 hours, but it actually took 15 hours. I've done this a few times 
with similar results.

Why? Spindle contention on the replacement drive. Graph the '%util' metric from 
something like 'iostat -xt 2' during a single disk backfill to get a very clear 
view that spindle contention is the true limiting fact

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-28 Thread Loic Dachary
Hi Blair,

On 28/08/2014 16:38, Blair Bethwaite wrote:
> Hi Loic,
> 
> Thanks for the reply and interesting discussion.

I'm learning a lot :-)

> On 26 August 2014 23:25, Loic Dachary  wrote:
>> Each time an OSD is lost, there is a 0.001*0.001 = 0.01% chance that two 
>> other disks are lost before recovery. Since the disk that failed initialy 
>> participates in 100 PG, that is 0.01% x 100 = 0.0001% chance that a PG 
>> is lost.
> 
> Seems okay, so you're just taking the max PG spread as the worst case
> (noting as demonstrated with my numbers that the spread could be
> lower).
> 
> ...actually, I could be way off here, but if the chance of any one
> disk failing in that time is 0.0001%, then assuming the first failure
> has already happened I'd have thought it would be more like:
> (0.0001% / 2) * 99 * (0.0001% / 2) * 98
> ?
> As you're essentially calculating the probability of one more disk out
> of the remaining 99 failing, and then another out of the remaining 98
> (and so on), within the repair window (dividing by the number of
> remaining replicas for which the probability is being calculated, as
> otherwise you'd be counting their chance of failure in the recovery
> window multiple times). And of course this all assumes the recovery
> continues gracefully from the remaining replica/s when another failure
> occurs...?

That makes sense. I chose to arbitrarily ignore the probability of the first 
failure to happen because the event is not bounded in time. The second failure 
matters as long as it happens in the interval it takes for the cluster to 
create the missing copies and that seemed more important. 

> Taking your followup correcting the base chances of failure into
> account, then that looks like:
> 99(1/10 / 2) * 98(1/10 / 2)
> = 9.702e-7
> 1 in 1030715

If a disk participates in 100 PG with replica 3, it means there is a maximum of 
200 other disks involved (if the cluster is large enough and the odds of two 
disks being used together in more than one PG are very low). You are assuming 
that this total is 100 which seems a reasonable approximation. I guess it could 
be verified by tests on a crushmap. However, it also means that the second 
failing disk probably shares 2 PG with the first failing disk, in which case 
the 98 should rather be 2 (i.e. the number of PG that are down to one replica 
as a result of the double failure).   

> I'm also skeptical on the 1h recovery time - at the very least the
> issues regarding stalling client ops come into play here and may push
> the max_backfills down for operational reasons (after all, you can't
> have a general purpose volume storage service that periodically spikes
> latency due to normal operational tasks like recoveries).

If the cluster is overloaded (disks I/O, cluster network), re-creating the lost 
copies within less than 2h seems indeed unlikely.
 
>> Or the entire pool if it is used in a way that loosing a PG means loosing 
>> all data in the pool (as in your example, where it contains RBD volumes and 
>> each of the RBD volume uses all the available PG).
> 
> Well, there's actually another whole interesting conversation in here
> - assuming a decent filesystem is sitting on top of those RBDs it
> should be possible to get those filesystems back into working order
> and identify any lost inodes, and then, if you've got one you can
> recover from tape backup. BUT, if you have just one pool for these
> RBDs spread over the entire cluster then the amount of work to do that
> fsck-ing is quickly going to be problematic - you'd have to fsck every
> RBD! So I wonder if there is cause for partitioning large clusters
> into multiple pools, so that such a failure would (hopefully) have a
> more limited scope. Backups for DR purposes are only worth having (and
> paying for) if the DR plan is actually practical.
> 
>> If the pool is using at least two datacenters operated by two different 
>> organizations, this calculation makes sense to me. However, if the cluster 
>> is in a single datacenter, isn't it possible that some event independent of 
>> Ceph has a greater probability of permanently destroying the data ? A month 
>> ago I lost three machines in a Ceph cluster and realized on that occasion 
>> that the crushmap was not configured properly and that PG were lost as a 
>> result. Fortunately I was able to recover the disks and plug them in another 
>> machine to recover the lost PGs. I'm not a system administrator and the 
>> probability of me failing to do the right thing is higher than normal: this 
>> is just an example of a high probability event leading to data loss. In 
>> other words, I wonder if this 0.0001% chance of losing a PG within the hour 
>> following a disk failure matters or if it is dominated by other factors. 
>> What do you think ?
> 
> I wouldn't expect that number to be dominated by the chances of
> total-loss/godzilla events, but I'm no datacentre reliability guru (at
> least we don't have Godzill

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-28 Thread Christian Balzer
On Thu, 28 Aug 2014 10:29:20 -0400 Mike Dawson wrote:

> On 8/28/2014 12:23 AM, Christian Balzer wrote:
> > On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote:
> >
> >>
> >>
> >> On 27/08/2014 04:34, Christian Balzer wrote:
> >>>
> >>> Hello,
> >>>
> >>> On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:
> >>>
>  Hi Craig,
> 
>  I assume the reason for the 48 hours recovery time is to keep the
>  cost of the cluster low ? I wrote "1h recovery time" because it is
>  roughly the time it would take to move 4TB over a 10Gb/s link.
>  Could you upgrade your hardware to reduce the recovery time to less
>  than two hours ? Or are there factors other than cost that prevent
>  this ?
> 
> >>>
> >>> I doubt Craig is operating on a shoestring budget.
> >>> And even if his network were to be just GbE, that would still make it
> >>> only 10 hours according to your wishful thinking formula.
> >>>
> >>> He probably has set the max_backfills to 1 because that is the level
> >>> of I/O his OSDs can handle w/o degrading cluster performance too
> >>> much. The network is unlikely to be the limiting factor.
> >>>
> >>> The way I see it most Ceph clusters are in sort of steady state when
> >>> operating normally, i.e. a few hundred VM RBD images ticking over,
> >>> most actual OSD disk ops are writes, as nearly all hot objects that
> >>> are being read are in the page cache of the storage nodes.
> >>> Easy peasy.
> >>>
> >>> Until something happens that breaks this routine, like a deep scrub,
> >>> all those VMs rebooting at the same time or a backfill caused by a
> >>> failed OSD. Now all of a sudden client ops compete with the backfill
> >>> ops, page caches are no longer hot, the spinners are seeking left and
> >>> right. Pandemonium.
> >>>
> >>> I doubt very much that even with a SSD backed cluster you would get
> >>> away with less than 2 hours for 4TB.
> >>>
> >>> To give you some real life numbers, I currently am building a new
> >>> cluster but for the time being have only one storage node to play
> >>> with. It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs
> >>> and 8 actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.
> >>>
> >>> So I took out one OSD (reweight 0 first, then the usual removal
> >>> steps) because the actual disk was wonky. Replaced the disk and
> >>> re-added the OSD. Both operations took about the same time, 4
> >>> minutes for evacuating the OSD (having 7 write targets clearly
> >>> helped) for measly 12GB or about 50MB/s and 5 minutes or about 35MB/
> >>> for refilling the OSD. And that is on one node (thus no network
> >>> latency) that has the default parameters (so a max_backfill of 10)
> >>> which was otherwise totally idle.
> >>>
> >>> In other words, in this pretty ideal case it would have taken 22
> >>> hours to re-distribute 4TB.
> >>
> >> That makes sense to me :-)
> >>
> >> When I wrote 1h, I thought about what happens when an OSD becomes
> >> unavailable with no planning in advance. In the scenario you describe
> >> the risk of a data loss does not increase since the objects are
> >> evicted gradually from the disk being decommissioned and the number
> >> of replica stays the same at all times. There is not a sudden drop in
> >> the number of replica  which is what I had in mind.
> >>
> > That may be, but I'm rather certain that there is no difference in
> > speed and priority of a rebalancing caused by an OSD set to weight 0
> > or one being set out.
> >
> >> If the lost OSD was part of 100 PG, the other disks (let say 50 of
> >> them) will start transferring a new replica of the objects they have
> >> to the new OSD in their PG. The replacement will not be a single OSD
> >> although nothing prevents the same OSD to be used in more than one PG
> >> as a replacement for the lost one. If the cluster network is
> >> connected at 10Gb/s and is 50% busy at all times, that leaves 5Gb/s.
> >> Since the new duplicates do not originate from a single OSD but from
> >> at least dozens of them and since they target more than one OSD, I
> >> assume we can expect an actual throughput of 5Gb/s. I should have
> >> written 2h instead of 1h to account for the fact that the cluster
> >> network is never idle.
> >>
> >> Am I being too optimistic ?
> > Vastly.
> >
> >> Do you see another blocking factor that
> >> would significantly slow down recovery ?
> >>
> > As Craig and I keep telling you, the network is not the limiting
> > factor. Concurrent disk IO is, as I pointed out in the other thread.
> 
> Completely agree.
> 
Thank you for that voice of reason, backing things up by a real life
sizable cluster. ^o^

> On a production cluster with OSDs backed by spindles, even with OSD 
> journals on SSDs, it is insufficient to calculate single-disk 
> replacement backfill time based solely on network throughput. IOPS will 
> likely be the limiting factor when backfilling a single failed spinner 
> in a production cluster.
> 
> Last week I replaced a

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-28 Thread Loic Dachary


On 28/08/2014 16:29, Mike Dawson wrote:
> On 8/28/2014 12:23 AM, Christian Balzer wrote:
>> On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote:
>>
>>>
>>>
>>> On 27/08/2014 04:34, Christian Balzer wrote:

 Hello,

 On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:

> Hi Craig,
>
> I assume the reason for the 48 hours recovery time is to keep the cost
> of the cluster low ? I wrote "1h recovery time" because it is roughly
> the time it would take to move 4TB over a 10Gb/s link. Could you
> upgrade your hardware to reduce the recovery time to less than two
> hours ? Or are there factors other than cost that prevent this ?
>

 I doubt Craig is operating on a shoestring budget.
 And even if his network were to be just GbE, that would still make it
 only 10 hours according to your wishful thinking formula.

 He probably has set the max_backfills to 1 because that is the level of
 I/O his OSDs can handle w/o degrading cluster performance too much.
 The network is unlikely to be the limiting factor.

 The way I see it most Ceph clusters are in sort of steady state when
 operating normally, i.e. a few hundred VM RBD images ticking over, most
 actual OSD disk ops are writes, as nearly all hot objects that are
 being read are in the page cache of the storage nodes.
 Easy peasy.

 Until something happens that breaks this routine, like a deep scrub,
 all those VMs rebooting at the same time or a backfill caused by a
 failed OSD. Now all of a sudden client ops compete with the backfill
 ops, page caches are no longer hot, the spinners are seeking left and
 right. Pandemonium.

 I doubt very much that even with a SSD backed cluster you would get
 away with less than 2 hours for 4TB.

 To give you some real life numbers, I currently am building a new
 cluster but for the time being have only one storage node to play with.
 It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8
 actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.

 So I took out one OSD (reweight 0 first, then the usual removal steps)
 because the actual disk was wonky. Replaced the disk and re-added the
 OSD. Both operations took about the same time, 4 minutes for
 evacuating the OSD (having 7 write targets clearly helped) for measly
 12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the
 OSD. And that is on one node (thus no network latency) that has the
 default parameters (so a max_backfill of 10) which was otherwise
 totally idle.

 In other words, in this pretty ideal case it would have taken 22 hours
 to re-distribute 4TB.
>>>
>>> That makes sense to me :-)
>>>
>>> When I wrote 1h, I thought about what happens when an OSD becomes
>>> unavailable with no planning in advance. In the scenario you describe
>>> the risk of a data loss does not increase since the objects are evicted
>>> gradually from the disk being decommissioned and the number of replica
>>> stays the same at all times. There is not a sudden drop in the number of
>>> replica  which is what I had in mind.
>>>
>> That may be, but I'm rather certain that there is no difference in speed
>> and priority of a rebalancing caused by an OSD set to weight 0 or one
>> being set out.
>>
>>> If the lost OSD was part of 100 PG, the other disks (let say 50 of them)
>>> will start transferring a new replica of the objects they have to the
>>> new OSD in their PG. The replacement will not be a single OSD although
>>> nothing prevents the same OSD to be used in more than one PG as a
>>> replacement for the lost one. If the cluster network is connected at
>>> 10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new
>>> duplicates do not originate from a single OSD but from at least dozens
>>> of them and since they target more than one OSD, I assume we can expect
>>> an actual throughput of 5Gb/s. I should have written 2h instead of 1h to
>>> account for the fact that the cluster network is never idle.
>>>
>>> Am I being too optimistic ?
>> Vastly.
>>
>>> Do you see another blocking factor that
>>> would significantly slow down recovery ?
>>>
>> As Craig and I keep telling you, the network is not the limiting factor.
>> Concurrent disk IO is, as I pointed out in the other thread.
> 
> Completely agree.
> 
> On a production cluster with OSDs backed by spindles, even with OSD journals 
> on SSDs, it is insufficient to calculate single-disk replacement backfill 
> time based solely on network throughput. IOPS will likely be the limiting 
> factor when backfilling a single failed spinner in a production cluster.
> 
> Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd 
> cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio of 
> 3:1), with dual 1GbE bonded NICs.
> 
> Using the only throughput math, backfi

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-28 Thread Blair Bethwaite
Hi Loic,

Thanks for the reply and interesting discussion.

On 26 August 2014 23:25, Loic Dachary  wrote:
> Each time an OSD is lost, there is a 0.001*0.001 = 0.01% chance that two 
> other disks are lost before recovery. Since the disk that failed initialy 
> participates in 100 PG, that is 0.01% x 100 = 0.0001% chance that a PG is 
> lost.

Seems okay, so you're just taking the max PG spread as the worst case
(noting as demonstrated with my numbers that the spread could be
lower).

...actually, I could be way off here, but if the chance of any one
disk failing in that time is 0.0001%, then assuming the first failure
has already happened I'd have thought it would be more like:
(0.0001% / 2) * 99 * (0.0001% / 2) * 98
?
As you're essentially calculating the probability of one more disk out
of the remaining 99 failing, and then another out of the remaining 98
(and so on), within the repair window (dividing by the number of
remaining replicas for which the probability is being calculated, as
otherwise you'd be counting their chance of failure in the recovery
window multiple times). And of course this all assumes the recovery
continues gracefully from the remaining replica/s when another failure
occurs...?

Taking your followup correcting the base chances of failure into
account, then that looks like:
99(1/10 / 2) * 98(1/10 / 2)
= 9.702e-7
1 in 1030715

I'm also skeptical on the 1h recovery time - at the very least the
issues regarding stalling client ops come into play here and may push
the max_backfills down for operational reasons (after all, you can't
have a general purpose volume storage service that periodically spikes
latency due to normal operational tasks like recoveries).

> Or the entire pool if it is used in a way that loosing a PG means loosing all 
> data in the pool (as in your example, where it contains RBD volumes and each 
> of the RBD volume uses all the available PG).

Well, there's actually another whole interesting conversation in here
- assuming a decent filesystem is sitting on top of those RBDs it
should be possible to get those filesystems back into working order
and identify any lost inodes, and then, if you've got one you can
recover from tape backup. BUT, if you have just one pool for these
RBDs spread over the entire cluster then the amount of work to do that
fsck-ing is quickly going to be problematic - you'd have to fsck every
RBD! So I wonder if there is cause for partitioning large clusters
into multiple pools, so that such a failure would (hopefully) have a
more limited scope. Backups for DR purposes are only worth having (and
paying for) if the DR plan is actually practical.

> If the pool is using at least two datacenters operated by two different 
> organizations, this calculation makes sense to me. However, if the cluster is 
> in a single datacenter, isn't it possible that some event independent of Ceph 
> has a greater probability of permanently destroying the data ? A month ago I 
> lost three machines in a Ceph cluster and realized on that occasion that the 
> crushmap was not configured properly and that PG were lost as a result. 
> Fortunately I was able to recover the disks and plug them in another machine 
> to recover the lost PGs. I'm not a system administrator and the probability 
> of me failing to do the right thing is higher than normal: this is just an 
> example of a high probability event leading to data loss. In other words, I 
> wonder if this 0.0001% chance of losing a PG within the hour following a disk 
> failure matters or if it is dominated by other factors. What do you think ?

I wouldn't expect that number to be dominated by the chances of
total-loss/godzilla events, but I'm no datacentre reliability guru (at
least we don't have Godzilla here in Melbourne yet anyway). I couldn't
very quickly find any stats on "one-in-one-hundred year" events that
might actually destroy a datacentre. Availability is another question
altogether, which you probably know the Uptime Institute has specific
figures for tiers 1-4. But in my mind you should expect datacentre
power outages as an operational (rather than disaster) event, and
you'd want your Ceph cluster to survive them unscathed. If that
Copysets paper mentioned a while ago has any merit (see
http://hackingdistributed.com/2014/02/14/chainsets/ for more on that),
then it seems like the chances of drive loss following an availability
event are much higher than normal.

-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-28 Thread Mike Dawson

On 8/28/2014 12:23 AM, Christian Balzer wrote:

On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote:




On 27/08/2014 04:34, Christian Balzer wrote:


Hello,

On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:


Hi Craig,

I assume the reason for the 48 hours recovery time is to keep the cost
of the cluster low ? I wrote "1h recovery time" because it is roughly
the time it would take to move 4TB over a 10Gb/s link. Could you
upgrade your hardware to reduce the recovery time to less than two
hours ? Or are there factors other than cost that prevent this ?



I doubt Craig is operating on a shoestring budget.
And even if his network were to be just GbE, that would still make it
only 10 hours according to your wishful thinking formula.

He probably has set the max_backfills to 1 because that is the level of
I/O his OSDs can handle w/o degrading cluster performance too much.
The network is unlikely to be the limiting factor.

The way I see it most Ceph clusters are in sort of steady state when
operating normally, i.e. a few hundred VM RBD images ticking over, most
actual OSD disk ops are writes, as nearly all hot objects that are
being read are in the page cache of the storage nodes.
Easy peasy.

Until something happens that breaks this routine, like a deep scrub,
all those VMs rebooting at the same time or a backfill caused by a
failed OSD. Now all of a sudden client ops compete with the backfill
ops, page caches are no longer hot, the spinners are seeking left and
right. Pandemonium.

I doubt very much that even with a SSD backed cluster you would get
away with less than 2 hours for 4TB.

To give you some real life numbers, I currently am building a new
cluster but for the time being have only one storage node to play with.
It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8
actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.

So I took out one OSD (reweight 0 first, then the usual removal steps)
because the actual disk was wonky. Replaced the disk and re-added the
OSD. Both operations took about the same time, 4 minutes for
evacuating the OSD (having 7 write targets clearly helped) for measly
12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the
OSD. And that is on one node (thus no network latency) that has the
default parameters (so a max_backfill of 10) which was otherwise
totally idle.

In other words, in this pretty ideal case it would have taken 22 hours
to re-distribute 4TB.


That makes sense to me :-)

When I wrote 1h, I thought about what happens when an OSD becomes
unavailable with no planning in advance. In the scenario you describe
the risk of a data loss does not increase since the objects are evicted
gradually from the disk being decommissioned and the number of replica
stays the same at all times. There is not a sudden drop in the number of
replica  which is what I had in mind.


That may be, but I'm rather certain that there is no difference in speed
and priority of a rebalancing caused by an OSD set to weight 0 or one
being set out.


If the lost OSD was part of 100 PG, the other disks (let say 50 of them)
will start transferring a new replica of the objects they have to the
new OSD in their PG. The replacement will not be a single OSD although
nothing prevents the same OSD to be used in more than one PG as a
replacement for the lost one. If the cluster network is connected at
10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new
duplicates do not originate from a single OSD but from at least dozens
of them and since they target more than one OSD, I assume we can expect
an actual throughput of 5Gb/s. I should have written 2h instead of 1h to
account for the fact that the cluster network is never idle.

Am I being too optimistic ?

Vastly.


Do you see another blocking factor that
would significantly slow down recovery ?


As Craig and I keep telling you, the network is not the limiting factor.
Concurrent disk IO is, as I pointed out in the other thread.


Completely agree.

On a production cluster with OSDs backed by spindles, even with OSD 
journals on SSDs, it is insufficient to calculate single-disk 
replacement backfill time based solely on network throughput. IOPS will 
likely be the limiting factor when backfilling a single failed spinner 
in a production cluster.


Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd 
cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio 
of 3:1), with dual 1GbE bonded NICs.


Using the only throughput math, backfill could have theoretically 
completed in a bit over 2.5 hours, but it actually took 15 hours. I've 
done this a few times with similar results.


Why? Spindle contention on the replacement drive. Graph the '%util' 
metric from something like 'iostat -xt 2' during a single disk backfill 
to get a very clear view that spindle contention is the true limiting 
factor. It'll be pegged at or near 100% if spindle contention is the issue.


- Mike


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-28 Thread Loic Dachary


On 28/08/2014 06:23, Christian Balzer wrote:
> On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote:
> 
>>
>>
>> On 27/08/2014 04:34, Christian Balzer wrote:
>>>
>>> Hello,
>>>
>>> On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:
>>>
 Hi Craig,

 I assume the reason for the 48 hours recovery time is to keep the cost
 of the cluster low ? I wrote "1h recovery time" because it is roughly
 the time it would take to move 4TB over a 10Gb/s link. Could you
 upgrade your hardware to reduce the recovery time to less than two
 hours ? Or are there factors other than cost that prevent this ?

>>>
>>> I doubt Craig is operating on a shoestring budget.
>>> And even if his network were to be just GbE, that would still make it
>>> only 10 hours according to your wishful thinking formula.
>>>
>>> He probably has set the max_backfills to 1 because that is the level of
>>> I/O his OSDs can handle w/o degrading cluster performance too much.
>>> The network is unlikely to be the limiting factor.
>>>
>>> The way I see it most Ceph clusters are in sort of steady state when
>>> operating normally, i.e. a few hundred VM RBD images ticking over, most
>>> actual OSD disk ops are writes, as nearly all hot objects that are
>>> being read are in the page cache of the storage nodes.
>>> Easy peasy.
>>>
>>> Until something happens that breaks this routine, like a deep scrub,
>>> all those VMs rebooting at the same time or a backfill caused by a
>>> failed OSD. Now all of a sudden client ops compete with the backfill
>>> ops, page caches are no longer hot, the spinners are seeking left and
>>> right. Pandemonium.
>>>
>>> I doubt very much that even with a SSD backed cluster you would get
>>> away with less than 2 hours for 4TB.
>>>
>>> To give you some real life numbers, I currently am building a new
>>> cluster but for the time being have only one storage node to play with.
>>> It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8
>>> actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.
>>>
>>> So I took out one OSD (reweight 0 first, then the usual removal steps)
>>> because the actual disk was wonky. Replaced the disk and re-added the
>>> OSD. Both operations took about the same time, 4 minutes for
>>> evacuating the OSD (having 7 write targets clearly helped) for measly
>>> 12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the
>>> OSD. And that is on one node (thus no network latency) that has the
>>> default parameters (so a max_backfill of 10) which was otherwise
>>> totally idle. 
>>>
>>> In other words, in this pretty ideal case it would have taken 22 hours
>>> to re-distribute 4TB.
>>
>> That makes sense to me :-) 
>>
>> When I wrote 1h, I thought about what happens when an OSD becomes
>> unavailable with no planning in advance. In the scenario you describe
>> the risk of a data loss does not increase since the objects are evicted
>> gradually from the disk being decommissioned and the number of replica
>> stays the same at all times. There is not a sudden drop in the number of
>> replica  which is what I had in mind.
>>
> That may be, but I'm rather certain that there is no difference in speed
> and priority of a rebalancing caused by an OSD set to weight 0 or one
> being set out.
> 
>> If the lost OSD was part of 100 PG, the other disks (let say 50 of them)
>> will start transferring a new replica of the objects they have to the
>> new OSD in their PG. The replacement will not be a single OSD although
>> nothing prevents the same OSD to be used in more than one PG as a
>> replacement for the lost one. If the cluster network is connected at
>> 10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new
>> duplicates do not originate from a single OSD but from at least dozens
>> of them and since they target more than one OSD, I assume we can expect
>> an actual throughput of 5Gb/s. I should have written 2h instead of 1h to
>> account for the fact that the cluster network is never idle.
>>
>> Am I being too optimistic ? 
> Vastly.
> 
>> Do you see another blocking factor that
>> would significantly slow down recovery ?
>>
> As Craig and I keep telling you, the network is not the limiting factor.
> Concurrent disk IO is, as I pointed out in the other thread.
> 
> Another example if you please:
> My shitty test cluster, 4 nodes, one OSD each, journal on disk, no SSDs. 
> 1 GbE links for client and cluster respectively.
> ---
> #ceph -s
> cluster 25bb48ec-689d-4cec-8494-d1a62ca509be
>  health HEALTH_OK
>  monmap e1: 1 mons at {irt03=192,168.0.33:6789/0}, election epoch 1, 
> quorum 0 irt03
>  osdmap e1206: 4 osds: 4 up, 4 in
>   pgmap v543045: 256 pgs, 3 pools, 62140 MB data, 15648 objects
> 141 GB used, 2323 GB / 2464 GB avail
>  256 active+clean
> ---
> replication size is 2, in can do about 60MB/s writes with rados bench from
> a client.
> 
> Setting one OSD out (the data distribution is nearly un

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-27 Thread Christian Balzer
On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote:

> 
> 
> On 27/08/2014 04:34, Christian Balzer wrote:
> > 
> > Hello,
> > 
> > On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:
> > 
> >> Hi Craig,
> >>
> >> I assume the reason for the 48 hours recovery time is to keep the cost
> >> of the cluster low ? I wrote "1h recovery time" because it is roughly
> >> the time it would take to move 4TB over a 10Gb/s link. Could you
> >> upgrade your hardware to reduce the recovery time to less than two
> >> hours ? Or are there factors other than cost that prevent this ?
> >>
> > 
> > I doubt Craig is operating on a shoestring budget.
> > And even if his network were to be just GbE, that would still make it
> > only 10 hours according to your wishful thinking formula.
> > 
> > He probably has set the max_backfills to 1 because that is the level of
> > I/O his OSDs can handle w/o degrading cluster performance too much.
> > The network is unlikely to be the limiting factor.
> > 
> > The way I see it most Ceph clusters are in sort of steady state when
> > operating normally, i.e. a few hundred VM RBD images ticking over, most
> > actual OSD disk ops are writes, as nearly all hot objects that are
> > being read are in the page cache of the storage nodes.
> > Easy peasy.
> > 
> > Until something happens that breaks this routine, like a deep scrub,
> > all those VMs rebooting at the same time or a backfill caused by a
> > failed OSD. Now all of a sudden client ops compete with the backfill
> > ops, page caches are no longer hot, the spinners are seeking left and
> > right. Pandemonium.
> > 
> > I doubt very much that even with a SSD backed cluster you would get
> > away with less than 2 hours for 4TB.
> > 
> > To give you some real life numbers, I currently am building a new
> > cluster but for the time being have only one storage node to play with.
> > It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8
> > actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.
> > 
> > So I took out one OSD (reweight 0 first, then the usual removal steps)
> > because the actual disk was wonky. Replaced the disk and re-added the
> > OSD. Both operations took about the same time, 4 minutes for
> > evacuating the OSD (having 7 write targets clearly helped) for measly
> > 12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the
> > OSD. And that is on one node (thus no network latency) that has the
> > default parameters (so a max_backfill of 10) which was otherwise
> > totally idle. 
> > 
> > In other words, in this pretty ideal case it would have taken 22 hours
> > to re-distribute 4TB.
> 
> That makes sense to me :-) 
> 
> When I wrote 1h, I thought about what happens when an OSD becomes
> unavailable with no planning in advance. In the scenario you describe
> the risk of a data loss does not increase since the objects are evicted
> gradually from the disk being decommissioned and the number of replica
> stays the same at all times. There is not a sudden drop in the number of
> replica  which is what I had in mind.
> 
That may be, but I'm rather certain that there is no difference in speed
and priority of a rebalancing caused by an OSD set to weight 0 or one
being set out.

> If the lost OSD was part of 100 PG, the other disks (let say 50 of them)
> will start transferring a new replica of the objects they have to the
> new OSD in their PG. The replacement will not be a single OSD although
> nothing prevents the same OSD to be used in more than one PG as a
> replacement for the lost one. If the cluster network is connected at
> 10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new
> duplicates do not originate from a single OSD but from at least dozens
> of them and since they target more than one OSD, I assume we can expect
> an actual throughput of 5Gb/s. I should have written 2h instead of 1h to
> account for the fact that the cluster network is never idle.
> 
> Am I being too optimistic ? 
Vastly.

> Do you see another blocking factor that
> would significantly slow down recovery ?
> 
As Craig and I keep telling you, the network is not the limiting factor.
Concurrent disk IO is, as I pointed out in the other thread.

Another example if you please:
My shitty test cluster, 4 nodes, one OSD each, journal on disk, no SSDs. 
1 GbE links for client and cluster respectively.
---
#ceph -s
cluster 25bb48ec-689d-4cec-8494-d1a62ca509be
 health HEALTH_OK
 monmap e1: 1 mons at {irt03=192,168.0.33:6789/0}, election epoch 1, quorum 
0 irt03
 osdmap e1206: 4 osds: 4 up, 4 in
  pgmap v543045: 256 pgs, 3 pools, 62140 MB data, 15648 objects
141 GB used, 2323 GB / 2464 GB avail
 256 active+clean
---
replication size is 2, in can do about 60MB/s writes with rados bench from
a client.

Setting one OSD out (the data distribution is nearly uniform) it took 12
minutes to recover on a completely idle (no clients connected) cluster.
The disk utilization was 

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-27 Thread Craig Lewis
I am using GigE.  I'm building a cluster using existing hardware, and the
network hasn't been my bottleneck (yet).

I've benchmarked the single disk recovery speed as about 50 MB/s, using max
backfills = 4, with SSD journals.  If I go higher, the disk bandwidth
increases slightly, and the latency starts increasing.
 At max backfills = 10, I regularly see OSD latency hit the 1 second mark.
 With max backfills = 4, OSD latency is pretty much the same as max
backfills = 1.  I haven't tested 5-9 yet.

I'm tracking latency by polling the OSD perf numbers every minute,
recording the delta from the previous poll, and calculating the average
latency over the last minute.  Given that it's an average over the last
minute, a 1 second average latency is way too high.  That usually means one
operation took > 30 seconds, and the other operations were mostly ok.  It's
common see blocked operations in ceph -w when latency is this high.


Using 50 MB/s for a single disk, that takes at least 14 hours to rebuild my
disks (4TB disk, 60% full).  If I'm not sitting in front of the computer, I
usually only run 2 backfills.  I'm very paranoid, due to some problems I
had early in the production release.  Most of these problems were caused by
64k XFS inodes, not Ceph.  But I have things working now, so I'm hesitant
to change anything.  :-)




On Tue, Aug 26, 2014 at 11:21 AM, Loic Dachary  wrote:

> Hi Craig,
>
> I assume the reason for the 48 hours recovery time is to keep the cost of
> the cluster low ? I wrote "1h recovery time" because it is roughly the time
> it would take to move 4TB over a 10Gb/s link. Could you upgrade your
> hardware to reduce the recovery time to less than two hours ? Or are there
> factors other than cost that prevent this ?
>
> Cheers
>
> On 26/08/2014 19:37, Craig Lewis wrote:
> > My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd max
> backfills = 1).   I believe that increases my risk of failure by 48^2 .
> Since your numbers are failure rate per hour per disk, I need to consider
> the risk for the whole time for each disk.  So more formally, rebuild time
> to the power of (replicas -1).
> >
> > So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much
> higher risk than 1 / 10^8.
> >
> >
> > A risk of 1/43,000 means that I'm more likely to lose data due to human
> error than disk failure.  Still, I can put a small bit of effort in to
> optimize recovery speed, and lower this number.  Managing human error is
> much harder.
> >
> >
> >
> >
> >
> >
> > On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary  l...@dachary.org>> wrote:
> >
> > Using percentages instead of numbers lead me to calculations errors.
> Here it is again using 1/100 instead of % for clarity ;-)
> >
> > Assuming that:
> >
> > * The pool is configured for three replicas (size = 3 which is the
> default)
> > * It takes one hour for Ceph to recover from the loss of a single OSD
> > * Any other disk has a 1/100,000 chance to fail within the hour
> following the failure of the first disk (assuming AFR
> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
> 8%, divided by the number of hours during a year == (0.08 / 8760) ~=
> 1/100,000
> > * A given disk does not participate in more than 100 PG
> >
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-27 Thread Loic Dachary


On 27/08/2014 04:34, Christian Balzer wrote:
> 
> Hello,
> 
> On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:
> 
>> Hi Craig,
>>
>> I assume the reason for the 48 hours recovery time is to keep the cost
>> of the cluster low ? I wrote "1h recovery time" because it is roughly
>> the time it would take to move 4TB over a 10Gb/s link. Could you upgrade
>> your hardware to reduce the recovery time to less than two hours ? Or
>> are there factors other than cost that prevent this ?
>>
> 
> I doubt Craig is operating on a shoestring budget.
> And even if his network were to be just GbE, that would still make it only
> 10 hours according to your wishful thinking formula.
> 
> He probably has set the max_backfills to 1 because that is the level of
> I/O his OSDs can handle w/o degrading cluster performance too much.
> The network is unlikely to be the limiting factor.
> 
> The way I see it most Ceph clusters are in sort of steady state when
> operating normally, i.e. a few hundred VM RBD images ticking over, most
> actual OSD disk ops are writes, as nearly all hot objects that are being
> read are in the page cache of the storage nodes.
> Easy peasy.
> 
> Until something happens that breaks this routine, like a deep scrub, all
> those VMs rebooting at the same time or a backfill caused by a failed OSD.
> Now all of a sudden client ops compete with the backfill ops, page caches
> are no longer hot, the spinners are seeking left and right. 
> Pandemonium.
> 
> I doubt very much that even with a SSD backed cluster you would get away
> with less than 2 hours for 4TB.
> 
> To give you some real life numbers, I currently am building a new cluster
> but for the time being have only one storage node to play with.
> It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8 actual
> OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.
> 
> So I took out one OSD (reweight 0 first, then the usual removal steps)
> because the actual disk was wonky. Replaced the disk and re-added the OSD.
> Both operations took about the same time, 4 minutes for evacuating the OSD
> (having 7 write targets clearly helped) for measly 12GB or about 50MB/s
> and 5 minutes or about 35MB/ for refilling the OSD. 
> And that is on one node (thus no network latency) that has the default
> parameters (so a max_backfill of 10) which was otherwise totally idle. 
> 
> In other words, in this pretty ideal case it would have taken 22 hours
> to re-distribute 4TB.

That makes sense to me :-) 

When I wrote 1h, I thought about what happens when an OSD becomes unavailable 
with no planning in advance. In the scenario you describe the risk of a data 
loss does not increase since the objects are evicted gradually from the disk 
being decommissioned and the number of replica stays the same at all times. 
There is not a sudden drop in the number of replica  which is what I had in 
mind.

If the lost OSD was part of 100 PG, the other disks (let say 50 of them) will 
start transferring a new replica of the objects they have to the new OSD in 
their PG. The replacement will not be a single OSD although nothing prevents 
the same OSD to be used in more than one PG as a replacement for the lost one. 
If the cluster network is connected at 10Gb/s and is 50% busy at all times, 
that leaves 5Gb/s. Since the new duplicates do not originate from a single OSD 
but from at least dozens of them and since they target more than one OSD, I 
assume we can expect an actual throughput of 5Gb/s. I should have written 2h 
instead of 1h to account for the fact that the cluster network is never idle.

Am I being too optimistic ? Do you see another blocking factor that would 
significantly slow down recovery ?

Cheers

> More in another reply.
> 
>> Cheers
>>
>> On 26/08/2014 19:37, Craig Lewis wrote:
>>> My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd
>>> max backfills = 1).   I believe that increases my risk of failure by
>>> 48^2 .  Since your numbers are failure rate per hour per disk, I need
>>> to consider the risk for the whole time for each disk.  So more
>>> formally, rebuild time to the power of (replicas -1).
>>>
>>> So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much
>>> higher risk than 1 / 10^8.
>>>
>>>
>>> A risk of 1/43,000 means that I'm more likely to lose data due to
>>> human error than disk failure.  Still, I can put a small bit of effort
>>> in to optimize recovery speed, and lower this number.  Managing human
>>> error is much harder.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary >> > wrote:
>>>
>>> Using percentages instead of numbers lead me to calculations
>>> errors. Here it is again using 1/100 instead of % for clarity ;-)
>>>
>>> Assuming that:
>>>
>>> * The pool is configured for three replicas (size = 3 which is the
>>> default)
>>> * It takes one hour for Ceph to recover from the loss of a single
>>> OSD
>>> * Any other disk has

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Christian Balzer
rmanently
> > destroying the data ? A month ago I lost three machines in a Ceph
> > cluster and realized on that occasion that the crushmap was not
> > configured properly and that PG were lost as a result. Fortunately I
> > was able to recover the disks and plug them in another machine to
> > recover the lost PGs. I'm not a system administrator and the
> > probability of me failing to do the right thing is higher than normal:
> > this is just an example of a high probability event leading to data
> > loss. In other words, I wonder if this 0.0001% chance of losing a PG
> > within the hour following a disk failure matters or if it is dominated
> > by other factors. What do you think ?
> > 
> > Cheers
> 
> On 26/08/2014 15:25, Loic Dachary wrote:> Hi Blair,
> > 
> > Assuming that:
> > 
> > * The pool is configured for three replicas (size = 3 which is the
> > default)
> > * It takes one hour for Ceph to recover from the loss of a single OSD
> > * Any other disk has a 0.001% chance to fail within the hour following
> > the failure of the first disk (assuming AFR
> > https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
> > 10%, divided by the number of hours during a year).
> > * A given disk does not participate in more than 100 PG
> > 
> > Each time an OSD is lost, there is a 0.001*0.001 = 0.01% chance
> > that two other disks are lost before recovery. Since the disk that
> > failed initialy participates in 100 PG, that is 0.01% x 100 =
> > 0.0001% chance that a PG is lost. Or the entire pool if it is used in
> > a way that loosing a PG means loosing all data in the pool (as in your
> > example, where it contains RBD volumes and each of the RBD volume uses
> > all the available PG).
> > 
> > If the pool is using at least two datacenters operated by two
> > different organizations, this calculation makes sense to me. However,
> > if the cluster is in a single datacenter, isn't it possible that some
> > event independent of Ceph has a greater probability of permanently
> > destroying the data ? A month ago I lost three machines in a Ceph
> > cluster and realized on that occasion that the crushmap was not
> > configured properly and that PG were lost as a result. Fortunately I
> > was able to recover the disks and plug them in another machine to
> > recover the lost PGs. I'm not a system administrator and the
> > probability of me failing to do the right thing is higher than normal:
> > this is just an example of a high probability event leading to data
> > loss. In other words, I wonder if this 0.0001% chance of losing a PG
> > within the hour following a disk failure matters or if it is dominated
> > by other factors. What do you think ?
> > 
> > Cheers
> > 
> > On 26/08/2014 02:23, Blair Bethwaite wrote:
> >>> Message: 25
> >>> Date: Fri, 15 Aug 2014 15:06:49 +0200
> >>> From: Loic Dachary 
> >>> To: Erik Logtenberg , ceph-users@lists.ceph.com
> >>> Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
> >>> Message-ID: <53ee05e9.1040...@dachary.org>
> >>> Content-Type: text/plain; charset="iso-8859-1"
> >>> ...
> >>> Here is how I reason about it, roughly:
> >>>
> >>> If the probability of loosing a disk is 0.1%, the probability of
> >>> loosing two disks simultaneously (i.e. before the failure can be
> >>> recovered) would be 0.1*0.1 = 0.01% and three disks becomes
> >>> 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001%
> >>
> >> I watched this conversation and an older similar one (Failure
> >> probability with largish deployments) with interest as we are in the
> >> process of planning a pretty large Ceph cluster (~3.5 PB), so I have
> >> been trying to wrap my head around these issues.
> >>
> >> Loic's reasoning (above) seems sound as a naive approximation assuming
> >> independent probabilities for disk failures, which may not be quite
> >> true given potential for batch production issues, but should be okay
> >> for other sorts of correlations (assuming a sane crushmap that
> >> eliminates things like controllers and nodes as sources of
> >> correlation).
> >>
> >> One of the things that came up in the "Failure probability with
> >> largish deployments" thread and has raised its head again here is the
> >> idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
> >> be somehow more prone to

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Christian Balzer

Hello,

On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:

> Hi Craig,
> 
> I assume the reason for the 48 hours recovery time is to keep the cost
> of the cluster low ? I wrote "1h recovery time" because it is roughly
> the time it would take to move 4TB over a 10Gb/s link. Could you upgrade
> your hardware to reduce the recovery time to less than two hours ? Or
> are there factors other than cost that prevent this ?
> 

I doubt Craig is operating on a shoestring budget.
And even if his network were to be just GbE, that would still make it only
10 hours according to your wishful thinking formula.

He probably has set the max_backfills to 1 because that is the level of
I/O his OSDs can handle w/o degrading cluster performance too much.
The network is unlikely to be the limiting factor.

The way I see it most Ceph clusters are in sort of steady state when
operating normally, i.e. a few hundred VM RBD images ticking over, most
actual OSD disk ops are writes, as nearly all hot objects that are being
read are in the page cache of the storage nodes.
Easy peasy.

Until something happens that breaks this routine, like a deep scrub, all
those VMs rebooting at the same time or a backfill caused by a failed OSD.
Now all of a sudden client ops compete with the backfill ops, page caches
are no longer hot, the spinners are seeking left and right. 
Pandemonium.

I doubt very much that even with a SSD backed cluster you would get away
with less than 2 hours for 4TB.

To give you some real life numbers, I currently am building a new cluster
but for the time being have only one storage node to play with.
It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8 actual
OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.

So I took out one OSD (reweight 0 first, then the usual removal steps)
because the actual disk was wonky. Replaced the disk and re-added the OSD.
Both operations took about the same time, 4 minutes for evacuating the OSD
(having 7 write targets clearly helped) for measly 12GB or about 50MB/s
and 5 minutes or about 35MB/ for refilling the OSD. 
And that is on one node (thus no network latency) that has the default
parameters (so a max_backfill of 10) which was otherwise totally idle. 

In other words, in this pretty ideal case it would have taken 22 hours
to re-distribute 4TB.

More in another reply.

> Cheers
> 
> On 26/08/2014 19:37, Craig Lewis wrote:
> > My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd
> > max backfills = 1).   I believe that increases my risk of failure by
> > 48^2 .  Since your numbers are failure rate per hour per disk, I need
> > to consider the risk for the whole time for each disk.  So more
> > formally, rebuild time to the power of (replicas -1).
> > 
> > So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much
> > higher risk than 1 / 10^8.
> > 
> > 
> > A risk of 1/43,000 means that I'm more likely to lose data due to
> > human error than disk failure.  Still, I can put a small bit of effort
> > in to optimize recovery speed, and lower this number.  Managing human
> > error is much harder.
> > 
> > 
> > 
> > 
> > 
> > 
> > On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary  > > wrote:
> > 
> > Using percentages instead of numbers lead me to calculations
> > errors. Here it is again using 1/100 instead of % for clarity ;-)
> > 
> > Assuming that:
> > 
> > * The pool is configured for three replicas (size = 3 which is the
> > default)
> > * It takes one hour for Ceph to recover from the loss of a single
> > OSD
> > * Any other disk has a 1/100,000 chance to fail within the hour
> > following the failure of the first disk (assuming AFR
> > https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
> > 8%, divided by the number of hours during a year == (0.08 / 8760) ~=
> > 1/100,000
> > * A given disk does not participate in more than 100 PG
> > 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Loic Dachary
Hi Craig,

I assume the reason for the 48 hours recovery time is to keep the cost of the 
cluster low ? I wrote "1h recovery time" because it is roughly the time it 
would take to move 4TB over a 10Gb/s link. Could you upgrade your hardware to 
reduce the recovery time to less than two hours ? Or are there factors other 
than cost that prevent this ?

Cheers

On 26/08/2014 19:37, Craig Lewis wrote:
> My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd max 
> backfills = 1).   I believe that increases my risk of failure by 48^2 .  
> Since your numbers are failure rate per hour per disk, I need to consider the 
> risk for the whole time for each disk.  So more formally, rebuild time to the 
> power of (replicas -1).
> 
> So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much higher 
> risk than 1 / 10^8.
> 
> 
> A risk of 1/43,000 means that I'm more likely to lose data due to human error 
> than disk failure.  Still, I can put a small bit of effort in to optimize 
> recovery speed, and lower this number.  Managing human error is much harder.
> 
> 
> 
> 
> 
> 
> On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary  > wrote:
> 
> Using percentages instead of numbers lead me to calculations errors. Here 
> it is again using 1/100 instead of % for clarity ;-)
> 
> Assuming that:
> 
> * The pool is configured for three replicas (size = 3 which is the 
> default)
> * It takes one hour for Ceph to recover from the loss of a single OSD
> * Any other disk has a 1/100,000 chance to fail within the hour following 
> the failure of the first disk (assuming AFR 
> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 8%, 
> divided by the number of hours during a year == (0.08 / 8760) ~= 1/100,000
> * A given disk does not participate in more than 100 PG
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Craig Lewis
My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd max
backfills = 1).   I believe that increases my risk of failure by 48^2 .
 Since your numbers are failure rate per hour per disk, I need to consider
the risk for the whole time for each disk.  So more formally, rebuild time
to the power of (replicas -1).

So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much
higher risk than 1 / 10^8.


A risk of 1/43,000 means that I'm more likely to lose data due to human
error than disk failure.  Still, I can put a small bit of effort in to
optimize recovery speed, and lower this number.  Managing human error is
much harder.






On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary  wrote:

> Using percentages instead of numbers lead me to calculations errors. Here
> it is again using 1/100 instead of % for clarity ;-)
>
> Assuming that:
>
> * The pool is configured for three replicas (size = 3 which is the default)
> * It takes one hour for Ceph to recover from the loss of a single OSD
> * Any other disk has a 1/100,000 chance to fail within the hour following
> the failure of the first disk (assuming AFR
> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
> 8%, divided by the number of hours during a year == (0.08 / 8760) ~=
> 1/100,000
> * A given disk does not participate in more than 100 PG
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Loic Dachary
 Or the entire pool if it is used in a way that loosing a PG means 
> loosing all data in the pool (as in your example, where it contains RBD 
> volumes and each of the RBD volume uses all the available PG).
> 
> If the pool is using at least two datacenters operated by two different 
> organizations, this calculation makes sense to me. However, if the cluster is 
> in a single datacenter, isn't it possible that some event independent of Ceph 
> has a greater probability of permanently destroying the data ? A month ago I 
> lost three machines in a Ceph cluster and realized on that occasion that the 
> crushmap was not configured properly and that PG were lost as a result. 
> Fortunately I was able to recover the disks and plug them in another machine 
> to recover the lost PGs. I'm not a system administrator and the probability 
> of me failing to do the right thing is higher than normal: this is just an 
> example of a high probability event leading to data loss. In other words, I 
> wonder if this 0.0001% chance of losing a PG within the hour following a disk 
> failure matters or if it is dominated by other factors. What do you think ?
> 
> Cheers
> 
> On 26/08/2014 02:23, Blair Bethwaite wrote:
>>> Message: 25
>>> Date: Fri, 15 Aug 2014 15:06:49 +0200
>>> From: Loic Dachary 
>>> To: Erik Logtenberg , ceph-users@lists.ceph.com
>>> Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
>>> Message-ID: <53ee05e9.1040...@dachary.org>
>>> Content-Type: text/plain; charset="iso-8859-1"
>>> ...
>>> Here is how I reason about it, roughly:
>>>
>>> If the probability of loosing a disk is 0.1%, the probability of loosing 
>>> two disks simultaneously (i.e. before the failure can be recovered) would 
>>> be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four 
>>> disks becomes 0.0001%
>>
>> I watched this conversation and an older similar one (Failure
>> probability with largish deployments) with interest as we are in the
>> process of planning a pretty large Ceph cluster (~3.5 PB), so I have
>> been trying to wrap my head around these issues.
>>
>> Loic's reasoning (above) seems sound as a naive approximation assuming
>> independent probabilities for disk failures, which may not be quite
>> true given potential for batch production issues, but should be okay
>> for other sorts of correlations (assuming a sane crushmap that
>> eliminates things like controllers and nodes as sources of
>> correlation).
>>
>> One of the things that came up in the "Failure probability with
>> largish deployments" thread and has raised its head again here is the
>> idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
>> be somehow more prone to data-loss than non-striped. I don't think
>> anyone has so far provided an answer on this, so here's my thinking...
>>
>> The level of atomicity that matters when looking at durability &
>> availability in Ceph is the Placement Group. For any non-trivial RBD
>> it is likely that many RBDs will span all/most PGs, e.g., even a
>> relatively small 50GiB volume would (with default 4MiB object size)
>> span 12800 PGs - more than there are in many production clusters
>> obeying the 100-200 PGs per drive rule of thumb. Losing any
>> one PG will cause data-loss. The failure-probability effects of
>> striping across multiple PGs are immaterial considering that loss of
>> any single PG is likely to damage all your RBDs. This
>> might be why the reliability calculator doesn't consider total number
>> of disks.
>>
>> Related to all this is the durability of 2 versus 3 replicas (or e.g.
>> M>=1 for Erasure Coding). It's easy to get caught up in the worrying
>> fallacy that losing any M OSDs will cause data-loss, but this isn't
>> true - they have to be members of the same PG for data-loss to occur.
>> So then it's tempting to think the chances of that happening are so
>> slim as to not matter and why would we ever even need 3 replicas. I
>> mean, what are the odds of exactly those 2 drives, out of the
>> 100,200... in my cluster, failing in ?! But therein
>> lays the rub - you should be thinking about PGs. If a drive fails then
>> the chance of a data-loss event resulting are dependent on the chances
>> of losing further drives from the affected/degraded PGs.
>>
>> I've got a real cluster at hand, so let's use that as an example. We
>> have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down
>> failure domains: r

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Loic Dachary
Hi Blair,

Assuming that:

* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 0.001% chance to fail within the hour following the 
failure of the first disk (assuming AFR 
https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 10%, 
divided by the number of hours during a year).
* A given disk does not participate in more than 100 PG

Each time an OSD is lost, there is a 0.001*0.001 = 0.01% chance that two 
other disks are lost before recovery. Since the disk that failed initialy 
participates in 100 PG, that is 0.01% x 100 = 0.0001% chance that a PG is 
lost. Or the entire pool if it is used in a way that loosing a PG means loosing 
all data in the pool (as in your example, where it contains RBD volumes and 
each of the RBD volume uses all the available PG).

If the pool is using at least two datacenters operated by two different 
organizations, this calculation makes sense to me. However, if the cluster is 
in a single datacenter, isn't it possible that some event independent of Ceph 
has a greater probability of permanently destroying the data ? A month ago I 
lost three machines in a Ceph cluster and realized on that occasion that the 
crushmap was not configured properly and that PG were lost as a result. 
Fortunately I was able to recover the disks and plug them in another machine to 
recover the lost PGs. I'm not a system administrator and the probability of me 
failing to do the right thing is higher than normal: this is just an example of 
a high probability event leading to data loss. In other words, I wonder if this 
0.0001% chance of losing a PG within the hour following a disk failure matters 
or if it is dominated by other factors. What do you think ?

Cheers

On 26/08/2014 02:23, Blair Bethwaite wrote:
>> Message: 25
>> Date: Fri, 15 Aug 2014 15:06:49 +0200
>> From: Loic Dachary 
>> To: Erik Logtenberg , ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
>> Message-ID: <53ee05e9.1040...@dachary.org>
>> Content-Type: text/plain; charset="iso-8859-1"
>> ...
>> Here is how I reason about it, roughly:
>>
>> If the probability of loosing a disk is 0.1%, the probability of loosing two 
>> disks simultaneously (i.e. before the failure can be recovered) would be 
>> 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks 
>> becomes 0.0001%
> 
> I watched this conversation and an older similar one (Failure
> probability with largish deployments) with interest as we are in the
> process of planning a pretty large Ceph cluster (~3.5 PB), so I have
> been trying to wrap my head around these issues.
> 
> Loic's reasoning (above) seems sound as a naive approximation assuming
> independent probabilities for disk failures, which may not be quite
> true given potential for batch production issues, but should be okay
> for other sorts of correlations (assuming a sane crushmap that
> eliminates things like controllers and nodes as sources of
> correlation).
> 
> One of the things that came up in the "Failure probability with
> largish deployments" thread and has raised its head again here is the
> idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
> be somehow more prone to data-loss than non-striped. I don't think
> anyone has so far provided an answer on this, so here's my thinking...
> 
> The level of atomicity that matters when looking at durability &
> availability in Ceph is the Placement Group. For any non-trivial RBD
> it is likely that many RBDs will span all/most PGs, e.g., even a
> relatively small 50GiB volume would (with default 4MiB object size)
> span 12800 PGs - more than there are in many production clusters
> obeying the 100-200 PGs per drive rule of thumb. Losing any
> one PG will cause data-loss. The failure-probability effects of
> striping across multiple PGs are immaterial considering that loss of
> any single PG is likely to damage all your RBDs. This
> might be why the reliability calculator doesn't consider total number
> of disks.
> 
> Related to all this is the durability of 2 versus 3 replicas (or e.g.
> M>=1 for Erasure Coding). It's easy to get caught up in the worrying
> fallacy that losing any M OSDs will cause data-loss, but this isn't
> true - they have to be members of the same PG for data-loss to occur.
> So then it's tempting to think the chances of that happening are so
> slim as to not matter and why would we ever even need 3 replicas. I
> mean, what are the odds of exactly those 2 drives, out of the
> 100,200... in my cluster, failing in ?! But therein
> lays the rub - y

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Christian Balzer

Hello,

On Tue, 26 Aug 2014 10:23:43 +1000 Blair Bethwaite wrote:

> > Message: 25
> > Date: Fri, 15 Aug 2014 15:06:49 +0200
> > From: Loic Dachary 
> > To: Erik Logtenberg , ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
> > Message-ID: <53ee05e9.1040...@dachary.org>
> > Content-Type: text/plain; charset="iso-8859-1"
> > ...
> > Here is how I reason about it, roughly:
> >
> > If the probability of loosing a disk is 0.1%, the probability of
> > loosing two disks simultaneously (i.e. before the failure can be
> > recovered) would be 0.1*0.1 = 0.01% and three disks becomes
> > 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001%
> 
> I watched this conversation and an older similar one (Failure
> probability with largish deployments) with interest as we are in the
> process of planning a pretty large Ceph cluster (~3.5 PB), so I have
> been trying to wrap my head around these issues.
>
As the OP of the "Failure probability with largish deployments" thread I
have to thank Blair for raising this issue again and doing the hard math
below. Which looks fine to me.

At the end of that slightly inconclusive thread I walked away with the
same impression as Blair, namely that the survival of PGs is the key
factor and that they will likely be spread out over most, if not all the
OSDs.

Which in turn did reinforce my decision to deploy our first production
Ceph cluster based on nodes with 2 OSDs backed by 11 disk RAID6 sets behind
a HW RAID controller with 4GB cache AND SDD journals. 
I can live with the reduced performance (which is caused by the OSD code
running out of steam long before the SSDs or the RAIDs do), because not
only do I save 1/3rd of the space and 1/4th of the cost compared to a
replication 3 cluster, the total of disks that need to fail within the
recovery window to cause data loss is now 4.

The next cluster I'm currently building is a classic Ceph design,
replication of 3, 8 OSD HDDs and 4 journal SSDs per node, because with
this cluster I won't have predictable I/O patterns and loads.
OTOH, I don't see it growing much beyond 48 OSDs, so I'm happy enough with
the odds here.

I think doing the exact maths for a cluster of the size you're planning
would be very interesting and also very much needed. 
3.5PB usable space would be close to 3000 disks with a replication of 3,
but even if you meant that as gross value it would probably mean that
you're looking at frequent, if not daily disk failures.


Regards,

Christian
> Loic's reasoning (above) seems sound as a naive approximation assuming
> independent probabilities for disk failures, which may not be quite
> true given potential for batch production issues, but should be okay
> for other sorts of correlations (assuming a sane crushmap that
> eliminates things like controllers and nodes as sources of
> correlation).
> 
> One of the things that came up in the "Failure probability with
> largish deployments" thread and has raised its head again here is the
> idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
> be somehow more prone to data-loss than non-striped. I don't think
> anyone has so far provided an answer on this, so here's my thinking...
> 
> The level of atomicity that matters when looking at durability &
> availability in Ceph is the Placement Group. For any non-trivial RBD
> it is likely that many RBDs will span all/most PGs, e.g., even a
> relatively small 50GiB volume would (with default 4MiB object size)
> span 12800 PGs - more than there are in many production clusters
> obeying the 100-200 PGs per drive rule of thumb. Losing any
> one PG will cause data-loss. The failure-probability effects of
> striping across multiple PGs are immaterial considering that loss of
> any single PG is likely to damage all your RBDs. This
> might be why the reliability calculator doesn't consider total number
> of disks.
> 
> Related to all this is the durability of 2 versus 3 replicas (or e.g.
> M>=1 for Erasure Coding). It's easy to get caught up in the worrying
> fallacy that losing any M OSDs will cause data-loss, but this isn't
> true - they have to be members of the same PG for data-loss to occur.
> So then it's tempting to think the chances of that happening are so
> slim as to not matter and why would we ever even need 3 replicas. I
> mean, what are the odds of exactly those 2 drives, out of the
> 100,200... in my cluster, failing in ?! But therein
> lays the rub - you should be thinking about PGs. If a drive fails then
> the chance of a data-loss event resulting are dependent on the chances
> of losing further drives from the affected/degraded PGs.
> 
> I'v

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-25 Thread Blair Bethwaite
> Message: 25
> Date: Fri, 15 Aug 2014 15:06:49 +0200
> From: Loic Dachary 
> To: Erik Logtenberg , ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
> Message-ID: <53ee05e9.1040...@dachary.org>
> Content-Type: text/plain; charset="iso-8859-1"
> ...
> Here is how I reason about it, roughly:
>
> If the probability of loosing a disk is 0.1%, the probability of loosing two 
> disks simultaneously (i.e. before the failure can be recovered) would be 
> 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks 
> becomes 0.0001%

I watched this conversation and an older similar one (Failure
probability with largish deployments) with interest as we are in the
process of planning a pretty large Ceph cluster (~3.5 PB), so I have
been trying to wrap my head around these issues.

Loic's reasoning (above) seems sound as a naive approximation assuming
independent probabilities for disk failures, which may not be quite
true given potential for batch production issues, but should be okay
for other sorts of correlations (assuming a sane crushmap that
eliminates things like controllers and nodes as sources of
correlation).

One of the things that came up in the "Failure probability with
largish deployments" thread and has raised its head again here is the
idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
be somehow more prone to data-loss than non-striped. I don't think
anyone has so far provided an answer on this, so here's my thinking...

The level of atomicity that matters when looking at durability &
availability in Ceph is the Placement Group. For any non-trivial RBD
it is likely that many RBDs will span all/most PGs, e.g., even a
relatively small 50GiB volume would (with default 4MiB object size)
span 12800 PGs - more than there are in many production clusters
obeying the 100-200 PGs per drive rule of thumb. Losing any
one PG will cause data-loss. The failure-probability effects of
striping across multiple PGs are immaterial considering that loss of
any single PG is likely to damage all your RBDs. This
might be why the reliability calculator doesn't consider total number
of disks.

Related to all this is the durability of 2 versus 3 replicas (or e.g.
M>=1 for Erasure Coding). It's easy to get caught up in the worrying
fallacy that losing any M OSDs will cause data-loss, but this isn't
true - they have to be members of the same PG for data-loss to occur.
So then it's tempting to think the chances of that happening are so
slim as to not matter and why would we ever even need 3 replicas. I
mean, what are the odds of exactly those 2 drives, out of the
100,200... in my cluster, failing in ?! But therein
lays the rub - you should be thinking about PGs. If a drive fails then
the chance of a data-loss event resulting are dependent on the chances
of losing further drives from the affected/degraded PGs.

I've got a real cluster at hand, so let's use that as an example. We
have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down
failure domains: rack pairs (x2), nodes, OSDs... Let's say OSD 15
dies. How many PGs are now at risk:
$ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | wc
109 109 861
(NB: 10 is the pool id, pg.dump is a text file dump of "ceph pg dump",
$15 is the acting set column)

109 PGs now "living on the edge". No surprises in that number as we
used 100 * 96 / 2 = 4800 to arrive at the PG count for this pool, so
on average any one OSD will be primary for 50 PGs and replica for
another 50. But this doesn't tell me how exposed I am, for that I need
to know how many "neighbouring" OSDs there are in these 109 PGs:
$ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | sed
's/\[15,\(.*\)\]/\1/' | sed 's/\[\(.*\),15\]/\1/' | sort | uniq | wc
 67  67 193
(NB: grep-ing for OSD "15" and using sed to remove it and surrounding
formatting to get just the neighbour id)

Yikes! So if any one of those 67 drives fails during recovery of OSD
15, then we've lost data. On average we should expect this to be
determined by our crushmap, which in this case splits the cluster up
into 2 top-level failure domains, so I'd guess it's the probability of
1 in 48 drives failing on average for this cluster. But actually
looking at the numbers for each OSD it is higher than that here - the
lowest distinct "neighbour" count we have is 50. Note that we haven't
tuned any of the options in our crushmap, so I guess maybe Ceph
favours fewer repeat sets by default when coming up with PGs(?).

Anyway, here's the average and top 10 neighbour counts (hope this
scripting is right! ;-):

$ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump 

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-15 Thread Loic Dachary


On 15/08/2014 15:42, Erik Logtenberg wrote:
>>>
>>> I haven't done the actual calculations, but given some % chance of disk
>>> failure, I would assume that losing x out of y disks has roughly the
>>> same chance as losing 2*x out of 2*y disks over the same period.
>>>
>>> That's also why you generally want to limit RAID5 arrays to maybe 6
>>> disks or so and move to RAID6 for bigger arrays. For arrays bigger than
>>> 20 disks you would usually split those into separate arrays, just to
>>> keep the (parity disks / total disks) fraction high enough.
>>>
>>> With regard to data safety I would guess that 3+2 and 6+4 are roughly
>>> equal, although the behaviour of 6+4 is probably easier to predict
>>> because bigger numbers makes your calculations less dependent on
>>> individual deviations in reliability.
>>>
>>> Do you guys feel this argument is valid?
>>
>> Here is how I reason about it, roughly:
>>
>> If the probability of loosing a disk is 0.1%, the probability of loosing two 
>> disks simultaneously (i.e. before the failure can be recovered) would be 
>> 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks 
>> becomes 0.0001% 
>>
>> Accurately calculating the reliability of the system as a whole is a lot 
>> more complex (see 
>> https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/
>>  for more information).
>>
>> Cheers
> 
> Okay, I see that in your calculation, you leave the total amount of
> disks completely out of the equation. 

Yes. If you have a small number of disks I'm not sure how to calculate the 
durability. For instance if I have 50 disk cluster within a rack, the 
durability is dominated by the probability that the rack is set on fire and 
increasing m from 3 to 5 is most certainly pointless ;-)

> The link you provided is very
> useful indeed and does some actual calculations. Interestingly, the
> example in the details page [1] use k=32 and m=32 for a total of 64 blocks.
> Those are very much bigger values than Mark Nelson mentioned earlier. Is
> that example merely meant to demonstrate the theoretical advantages, or
> would you actually recommend using those numbers in practice.
> Let's assume that we have at least 64 OSD's available, would you
> recommend k=32 and m=32?

It is theoretical, I'm not aware of any Ceph use case requiring that kind of 
setting. There may be a use case though, it's not absurd, just not common. I 
would be happy to hear about it.

Cheers

> 
> [1]
> https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/Technical_details_on_the_model
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-15 Thread Erik Logtenberg
>>
>> I haven't done the actual calculations, but given some % chance of disk
>> failure, I would assume that losing x out of y disks has roughly the
>> same chance as losing 2*x out of 2*y disks over the same period.
>>
>> That's also why you generally want to limit RAID5 arrays to maybe 6
>> disks or so and move to RAID6 for bigger arrays. For arrays bigger than
>> 20 disks you would usually split those into separate arrays, just to
>> keep the (parity disks / total disks) fraction high enough.
>>
>> With regard to data safety I would guess that 3+2 and 6+4 are roughly
>> equal, although the behaviour of 6+4 is probably easier to predict
>> because bigger numbers makes your calculations less dependent on
>> individual deviations in reliability.
>>
>> Do you guys feel this argument is valid?
> 
> Here is how I reason about it, roughly:
> 
> If the probability of loosing a disk is 0.1%, the probability of loosing two 
> disks simultaneously (i.e. before the failure can be recovered) would be 
> 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks 
> becomes 0.0001% 
> 
> Accurately calculating the reliability of the system as a whole is a lot more 
> complex (see 
> https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/ 
> for more information).
> 
> Cheers

Okay, I see that in your calculation, you leave the total amount of
disks completely out of the equation. The link you provided is very
useful indeed and does some actual calculations. Interestingly, the
example in the details page [1] use k=32 and m=32 for a total of 64 blocks.
Those are very much bigger values than Mark Nelson mentioned earlier. Is
that example merely meant to demonstrate the theoretical advantages, or
would you actually recommend using those numbers in practice.
Let's assume that we have at least 64 OSD's available, would you
recommend k=32 and m=32?

[1]
https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/Technical_details_on_the_model

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-15 Thread Loic Dachary


On 15/08/2014 14:36, Erik Logtenberg wrote:
> Now, there are certain combinations of K and M that appear to have more
> or less the same result. Do any of these combinations have pro's and
> con's that I should consider and/or are there best practices for
> choosing the right K/M-parameters?
>
>>>
>>> Loic might have a better anwser, but I think that the more segments (K)
>>> you have, the heavier recovery. You have to contact more OSDs to
>>> reconstruct the whole object so that involves more disks doing seeks.
>>>
>>> I heard sombody from Fujitsu say that he thought 8/3 was best for most
>>> situations. That wasn't with Ceph though, but with a different system
>>> which implemented Erasure Coding.
>>
>> Performance is definitely lower with more segments in Ceph.  I kind of
>> gravitate toward 4/2 or 6/2, though that's just my own preference.
> 
> This is indeed the kind of pro's and con's I was thinking about.
> Performance-wise, I would expect differences, but I can think of both
> positive and negative effects of bigger values for K.
> 
> For instance, yes recovery takes more OSD's with bigger values of K, but
> it seems to me that there are also less or smaller items to recover.
> Also read-performance generally appears to benefit from having a bigger
> cluster (more parallellism), so I can imagine that bigger values of K
> also provide an increase in read-performance.
> 
> Mark says more segments hurts performance though, are you referring just
> to rebuild-performance or also basic operational performance (read/write)?
> 
> For instance, if I choose K = 3 and M = 2, then pg's in this pool will
> use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in
> this configuration.
>
> Now, if I were to choose K = 6 and M = 4, I would end up with pg's that
> use 10 OSD's and sustain the loss of 4 OSD's, which is statistically
> not
> so much different from the first configuration. Also there is the same
> 40% overhead.

 Although I don't have numbers in mind, I think the odds of loosing two
 OSD simultaneously are a lot smaller than the odds of loosing four OSD
 simultaneously. Or am I misunderstanding you when you write
 "statistically not so much different from the first configuration" ?

>>>
>>> Loosing two smaller then loosing four? Is that correct or did you mean
>>> it the other way around?
>>>
>>> I'd say that loosing four OSDs simultaneously is less likely to happen
>>> then two simultaneously.
>>
>> This is true, though the more disks you spread your objects across, the
>> higher likelihood that any given object will be affected by a lost OSD.
>>  The extreme case being that every object is spread across every OSD and
>> losing any given OSD affects all objects.  I suppose the severity
>> depends on the relative fraction of your erasure coding parameters
>> relative to the total number of OSDs.  I think this is perhaps what Erik
>> was getting at.
> 
> I haven't done the actual calculations, but given some % chance of disk
> failure, I would assume that losing x out of y disks has roughly the
> same chance as losing 2*x out of 2*y disks over the same period.
> 
> That's also why you generally want to limit RAID5 arrays to maybe 6
> disks or so and move to RAID6 for bigger arrays. For arrays bigger than
> 20 disks you would usually split those into separate arrays, just to
> keep the (parity disks / total disks) fraction high enough.
> 
> With regard to data safety I would guess that 3+2 and 6+4 are roughly
> equal, although the behaviour of 6+4 is probably easier to predict
> because bigger numbers makes your calculations less dependent on
> individual deviations in reliability.
> 
> Do you guys feel this argument is valid?

Here is how I reason about it, roughly:

If the probability of loosing a disk is 0.1%, the probability of loosing two 
disks simultaneously (i.e. before the failure can be recovered) would be 
0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks 
becomes 0.0001% 

Accurately calculating the reliability of the system as a whole is a lot more 
complex (see 
https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/ 
for more information).

Cheers

> 
> Erik.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-15 Thread Loic Dachary


On 15/08/2014 13:24, Wido den Hollander wrote:
> On 08/15/2014 12:23 PM, Loic Dachary wrote:
>> Hi Erik,
>>
>> On 15/08/2014 11:54, Erik Logtenberg wrote:
>>> Hi,
>>>
>>> With EC pools in Ceph you are free to choose any K and M parameters you
>>> like. The documentation explains what K and M do, so far so good.
>>>
>>> Now, there are certain combinations of K and M that appear to have more
>>> or less the same result. Do any of these combinations have pro's and
>>> con's that I should consider and/or are there best practices for
>>> choosing the right K/M-parameters?
>>>
> 
> Loic might have a better anwser, but I think that the more segments (K) you 
> have, the heavier recovery. You have to contact more OSDs to reconstruct the 
> whole object so that involves more disks doing seeks.
> 
> I heard sombody from Fujitsu say that he thought 8/3 was best for most 
> situations. That wasn't with Ceph though, but with a different system which 
> implemented Erasure Coding.
> 
>>> For instance, if I choose K = 3 and M = 2, then pg's in this pool will
>>> use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in
>>> this configuration.
>>>
>>> Now, if I were to choose K = 6 and M = 4, I would end up with pg's that
>>> use 10 OSD's and sustain the loss of 4 OSD's, which is statistically not
>>> so much different from the first configuration. Also there is the same
>>> 40% overhead.
>>
>> Although I don't have numbers in mind, I think the odds of loosing two OSD 
>> simultaneously are a lot smaller than the odds of loosing four OSD 
>> simultaneously. Or am I misunderstanding you when you write "statistically 
>> not so much different from the first configuration" ?
>>
> 
> Loosing two smaller then loosing four? Is that correct or did you mean it the 
> other way around?


Right, sorry for the confusion, I meant the other way around :-)

> 
> I'd say that loosing four OSDs simultaneously is less likely to happen then 
> two simultaneously.
> 
>> Cheers
>>
>>> One rather obvious difference between the two configurations is that the
>>> latter requires a cluster with at least 10 OSD's to make sense. But
>>> let's say we have such a cluster, which of the two configurations would
>>> be recommended, and why?
>>>
>>> Thanks,
>>>
>>> Erik.
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-15 Thread Erik Logtenberg
 Now, there are certain combinations of K and M that appear to have more
 or less the same result. Do any of these combinations have pro's and
 con's that I should consider and/or are there best practices for
 choosing the right K/M-parameters?

>>
>> Loic might have a better anwser, but I think that the more segments (K)
>> you have, the heavier recovery. You have to contact more OSDs to
>> reconstruct the whole object so that involves more disks doing seeks.
>>
>> I heard sombody from Fujitsu say that he thought 8/3 was best for most
>> situations. That wasn't with Ceph though, but with a different system
>> which implemented Erasure Coding.
> 
> Performance is definitely lower with more segments in Ceph.  I kind of
> gravitate toward 4/2 or 6/2, though that's just my own preference.

This is indeed the kind of pro's and con's I was thinking about.
Performance-wise, I would expect differences, but I can think of both
positive and negative effects of bigger values for K.

For instance, yes recovery takes more OSD's with bigger values of K, but
it seems to me that there are also less or smaller items to recover.
Also read-performance generally appears to benefit from having a bigger
cluster (more parallellism), so I can imagine that bigger values of K
also provide an increase in read-performance.

Mark says more segments hurts performance though, are you referring just
to rebuild-performance or also basic operational performance (read/write)?

 For instance, if I choose K = 3 and M = 2, then pg's in this pool will
 use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in
 this configuration.

 Now, if I were to choose K = 6 and M = 4, I would end up with pg's that
 use 10 OSD's and sustain the loss of 4 OSD's, which is statistically
 not
 so much different from the first configuration. Also there is the same
 40% overhead.
>>>
>>> Although I don't have numbers in mind, I think the odds of loosing two
>>> OSD simultaneously are a lot smaller than the odds of loosing four OSD
>>> simultaneously. Or am I misunderstanding you when you write
>>> "statistically not so much different from the first configuration" ?
>>>
>>
>> Loosing two smaller then loosing four? Is that correct or did you mean
>> it the other way around?
>>
>> I'd say that loosing four OSDs simultaneously is less likely to happen
>> then two simultaneously.
> 
> This is true, though the more disks you spread your objects across, the
> higher likelihood that any given object will be affected by a lost OSD.
>  The extreme case being that every object is spread across every OSD and
> losing any given OSD affects all objects.  I suppose the severity
> depends on the relative fraction of your erasure coding parameters
> relative to the total number of OSDs.  I think this is perhaps what Erik
> was getting at.

I haven't done the actual calculations, but given some % chance of disk
failure, I would assume that losing x out of y disks has roughly the
same chance as losing 2*x out of 2*y disks over the same period.

That's also why you generally want to limit RAID5 arrays to maybe 6
disks or so and move to RAID6 for bigger arrays. For arrays bigger than
20 disks you would usually split those into separate arrays, just to
keep the (parity disks / total disks) fraction high enough.

With regard to data safety I would guess that 3+2 and 6+4 are roughly
equal, although the behaviour of 6+4 is probably easier to predict
because bigger numbers makes your calculations less dependent on
individual deviations in reliability.

Do you guys feel this argument is valid?

Erik.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-15 Thread Mark Nelson

On 08/15/2014 06:24 AM, Wido den Hollander wrote:

On 08/15/2014 12:23 PM, Loic Dachary wrote:

Hi Erik,

On 15/08/2014 11:54, Erik Logtenberg wrote:

Hi,

With EC pools in Ceph you are free to choose any K and M parameters you
like. The documentation explains what K and M do, so far so good.

Now, there are certain combinations of K and M that appear to have more
or less the same result. Do any of these combinations have pro's and
con's that I should consider and/or are there best practices for
choosing the right K/M-parameters?



Loic might have a better anwser, but I think that the more segments (K)
you have, the heavier recovery. You have to contact more OSDs to
reconstruct the whole object so that involves more disks doing seeks.

I heard sombody from Fujitsu say that he thought 8/3 was best for most
situations. That wasn't with Ceph though, but with a different system
which implemented Erasure Coding.


Performance is definitely lower with more segments in Ceph.  I kind of 
gravitate toward 4/2 or 6/2, though that's just my own preference.





For instance, if I choose K = 3 and M = 2, then pg's in this pool will
use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in
this configuration.

Now, if I were to choose K = 6 and M = 4, I would end up with pg's that
use 10 OSD's and sustain the loss of 4 OSD's, which is statistically not
so much different from the first configuration. Also there is the same
40% overhead.


Although I don't have numbers in mind, I think the odds of loosing two
OSD simultaneously are a lot smaller than the odds of loosing four OSD
simultaneously. Or am I misunderstanding you when you write
"statistically not so much different from the first configuration" ?



Loosing two smaller then loosing four? Is that correct or did you mean
it the other way around?

I'd say that loosing four OSDs simultaneously is less likely to happen
then two simultaneously.


This is true, though the more disks you spread your objects across, the 
higher likelihood that any given object will be affected by a lost OSD. 
 The extreme case being that every object is spread across every OSD 
and losing any given OSD affects all objects.  I suppose the severity 
depends on the relative fraction of your erasure coding parameters 
relative to the total number of OSDs.  I think this is perhaps what Erik 
was getting at.





Cheers


One rather obvious difference between the two configurations is that the
latter requires a cluster with at least 10 OSD's to make sense. But
let's say we have such a cluster, which of the two configurations would
be recommended, and why?

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-15 Thread Wido den Hollander

On 08/15/2014 12:23 PM, Loic Dachary wrote:

Hi Erik,

On 15/08/2014 11:54, Erik Logtenberg wrote:

Hi,

With EC pools in Ceph you are free to choose any K and M parameters you
like. The documentation explains what K and M do, so far so good.

Now, there are certain combinations of K and M that appear to have more
or less the same result. Do any of these combinations have pro's and
con's that I should consider and/or are there best practices for
choosing the right K/M-parameters?



Loic might have a better anwser, but I think that the more segments (K) 
you have, the heavier recovery. You have to contact more OSDs to 
reconstruct the whole object so that involves more disks doing seeks.


I heard sombody from Fujitsu say that he thought 8/3 was best for most 
situations. That wasn't with Ceph though, but with a different system 
which implemented Erasure Coding.



For instance, if I choose K = 3 and M = 2, then pg's in this pool will
use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in
this configuration.

Now, if I were to choose K = 6 and M = 4, I would end up with pg's that
use 10 OSD's and sustain the loss of 4 OSD's, which is statistically not
so much different from the first configuration. Also there is the same
40% overhead.


Although I don't have numbers in mind, I think the odds of loosing two OSD simultaneously 
are a lot smaller than the odds of loosing four OSD simultaneously. Or am I 
misunderstanding you when you write "statistically not so much different from the 
first configuration" ?



Loosing two smaller then loosing four? Is that correct or did you mean 
it the other way around?


I'd say that loosing four OSDs simultaneously is less likely to happen 
then two simultaneously.



Cheers


One rather obvious difference between the two configurations is that the
latter requires a cluster with at least 10 OSD's to make sense. But
let's say we have such a cluster, which of the two configurations would
be recommended, and why?

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-15 Thread Loic Dachary
Hi Erik,

On 15/08/2014 11:54, Erik Logtenberg wrote:
> Hi,
> 
> With EC pools in Ceph you are free to choose any K and M parameters you
> like. The documentation explains what K and M do, so far so good.
> 
> Now, there are certain combinations of K and M that appear to have more
> or less the same result. Do any of these combinations have pro's and
> con's that I should consider and/or are there best practices for
> choosing the right K/M-parameters?
> 
> For instance, if I choose K = 3 and M = 2, then pg's in this pool will
> use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in
> this configuration.
> 
> Now, if I were to choose K = 6 and M = 4, I would end up with pg's that
> use 10 OSD's and sustain the loss of 4 OSD's, which is statistically not
> so much different from the first configuration. Also there is the same
> 40% overhead.

Although I don't have numbers in mind, I think the odds of loosing two OSD 
simultaneously are a lot smaller than the odds of loosing four OSD 
simultaneously. Or am I misunderstanding you when you write "statistically not 
so much different from the first configuration" ?

Cheers

> One rather obvious difference between the two configurations is that the
> latter requires a cluster with at least 10 OSD's to make sense. But
> let's say we have such a cluster, which of the two configurations would
> be recommended, and why?
> 
> Thanks,
> 
> Erik.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Best practice K/M-parameters EC pool

2014-08-15 Thread Erik Logtenberg
Hi,

With EC pools in Ceph you are free to choose any K and M parameters you
like. The documentation explains what K and M do, so far so good.

Now, there are certain combinations of K and M that appear to have more
or less the same result. Do any of these combinations have pro's and
con's that I should consider and/or are there best practices for
choosing the right K/M-parameters?

For instance, if I choose K = 3 and M = 2, then pg's in this pool will
use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in
this configuration.

Now, if I were to choose K = 6 and M = 4, I would end up with pg's that
use 10 OSD's and sustain the loss of 4 OSD's, which is statistically not
so much different from the first configuration. Also there is the same
40% overhead.

One rather obvious difference between the two configurations is that the
latter requires a cluster with at least 10 OSD's to make sense. But
let's say we have such a cluster, which of the two configurations would
be recommended, and why?

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com