Re: [ceph-users] Best practice K/M-parameters EC pool

Mike Dawson Thu, 28 Aug 2014 10:39:01 -0700

On 8/28/2014 11:17 AM, Loic Dachary wrote:



On 28/08/2014 16:29, Mike Dawson wrote:

On 8/28/2014 12:23 AM, Christian Balzer wrote:

On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote:



On 27/08/2014 04:34, Christian Balzer wrote:


Hello,

On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:

Hi Craig,

I assume the reason for the 48 hours recovery time is to keep the cost
of the cluster low ? I wrote "1h recovery time" because it is roughly
the time it would take to move 4TB over a 10Gb/s link. Could you
upgrade your hardware to reduce the recovery time to less than two
hours ? Or are there factors other than cost that prevent this ?


I doubt Craig is operating on a shoestring budget.
And even if his network were to be just GbE, that would still make it
only 10 hours according to your wishful thinking formula.

He probably has set the max_backfills to 1 because that is the level of
I/O his OSDs can handle w/o degrading cluster performance too much.
The network is unlikely to be the limiting factor.

The way I see it most Ceph clusters are in sort of steady state when
operating normally, i.e. a few hundred VM RBD images ticking over, most
actual OSD disk ops are writes, as nearly all hot objects that are
being read are in the page cache of the storage nodes.
Easy peasy.

Until something happens that breaks this routine, like a deep scrub,
all those VMs rebooting at the same time or a backfill caused by a
failed OSD. Now all of a sudden client ops compete with the backfill
ops, page caches are no longer hot, the spinners are seeking left and
right. Pandemonium.

I doubt very much that even with a SSD backed cluster you would get
away with less than 2 hours for 4TB.

To give you some real life numbers, I currently am building a new
cluster but for the time being have only one storage node to play with.
It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8
actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.

So I took out one OSD (reweight 0 first, then the usual removal steps)
because the actual disk was wonky. Replaced the disk and re-added the
OSD. Both operations took about the same time, 4 minutes for
evacuating the OSD (having 7 write targets clearly helped) for measly
12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the
OSD. And that is on one node (thus no network latency) that has the
default parameters (so a max_backfill of 10) which was otherwise
totally idle.

In other words, in this pretty ideal case it would have taken 22 hours
to re-distribute 4TB.


That makes sense to me :-)

When I wrote 1h, I thought about what happens when an OSD becomes
unavailable with no planning in advance. In the scenario you describe
the risk of a data loss does not increase since the objects are evicted
gradually from the disk being decommissioned and the number of replica
stays the same at all times. There is not a sudden drop in the number of
replica  which is what I had in mind.

That may be, but I'm rather certain that there is no difference in speed
and priority of a rebalancing caused by an OSD set to weight 0 or one
being set out.

If the lost OSD was part of 100 PG, the other disks (let say 50 of them)
will start transferring a new replica of the objects they have to the
new OSD in their PG. The replacement will not be a single OSD although
nothing prevents the same OSD to be used in more than one PG as a
replacement for the lost one. If the cluster network is connected at
10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new
duplicates do not originate from a single OSD but from at least dozens
of them and since they target more than one OSD, I assume we can expect
an actual throughput of 5Gb/s. I should have written 2h instead of 1h to
account for the fact that the cluster network is never idle.

Am I being too optimistic ?

Vastly.

Do you see another blocking factor that
would significantly slow down recovery ?

As Craig and I keep telling you, the network is not the limiting factor.
Concurrent disk IO is, as I pointed out in the other thread.


Completely agree.

On a production cluster with OSDs backed by spindles, even with OSD journals on 
SSDs, it is insufficient to calculate single-disk replacement backfill time 
based solely on network throughput. IOPS will likely be the limiting factor 
when backfilling a single failed spinner in a production cluster.

Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd 
cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio of 
3:1), with dual 1GbE bonded NICs.

Using the only throughput math, backfill could have theoretically completed in 
a bit over 2.5 hours, but it actually took 15 hours. I've done this a few times 
with similar results.

Why? Spindle contention on the replacement drive. Graph the '%util' metric from 
something like 'iostat -xt 2' during a single disk backfill to get a very clear 
view that spindle contention is the true limiting factor. It'll be pegged at or 
near 100% if spindle contention is the issue.


Hi Mike,

Did you by any chance also measure how long it took for the 3 replicas to be 
restored on all PG in which the failed disk was participating ? I assume the 
following sequence happened:

A) The 3TB drive failed and contained ~2TB
B) The cluster recovered by creating new replicas
C) The new 3TB drive was installed
D) Backfilling completed

I'm interested in the time between A and B, i.e. when one copy is potentially 
lost forever, because this is when the probability of a permanent data loss 
increases. Although it is important to reduce the time between C and D to a 
minimum, it has no impact on the durability of the data.


Loic,

We use 3x replication and have drives that have relatively highsteady-state IOPS. Therefore, we tend to prioritize client-side IO morethan a reduction from 3 copies to 2 during the loss of one disk. Thedisruption to client io is so great on our cluster, we don't want ourcluster to be in a recovery state without operator-supervision.

Letting OSDs get marked out without operator intervention was a disasterin the early going of our cluster. For example, an OSD daemon crashwould trigger automatic recovery where it was unneeded. Ironically,often times the unneeded recovery would often trigger additional daemonsto crash, making a bad situation worse. During the recovery, rbd clientio would often times go to 0.

To deal with this issue, we set "mon osd down out interval = 14400", soas operators we have 4 hours to intervene before Ceph attempts toself-heal. When hardware is at fault, we remove the osd, replace thedrive, re-add the osd, then allow backfill to begin, thereby completelyskipping step B in your timeline above.


- Mike

Cheers

- Mike


Another example if you please:
My shitty test cluster, 4 nodes, one OSD each, journal on disk, no SSDs.
1 GbE links for client and cluster respectively.
---
#ceph -s
      cluster 25bb48ec-689d-4cec-8494-d1a62ca509be
       health HEALTH_OK
       monmap e1: 1 mons at {irt03=192,168.0.33:6789/0}, election epoch 1, 
quorum 0 irt03
       osdmap e1206: 4 osds: 4 up, 4 in
        pgmap v543045: 256 pgs, 3 pools, 62140 MB data, 15648 objects
              141 GB used, 2323 GB / 2464 GB avail
                   256 active+clean
---
replication size is 2, in can do about 60MB/s writes with rados bench from
a client.

Setting one OSD out (the data distribution is nearly uniform) it took 12
minutes to recover on a completely idle (no clients connected) cluster.
The disk utilization was 70-90%, the cluster network hovered around 20%,
never exceeding 35% on the 3 "surviving" nodes. CPU was never an issue.
Given the ceph log numbers and the data size, I make this a recovery speed
of about 40MB/s or 13MB/s per OSD.
Better than I expected, but a far cry from what the OSDs could do
individually if they were not flooded with concurrent read and write
requests by the backfilling operation.

Now, more disks will help, but I very much doubt that this will scale
linear, so 50 OSDs won't give you 500MB/s (somebody prove me wrong please).

And this was an IDLE cluster.

Doing this on a cluster with just about 10 client IOPS per OSD would be
far worse. Never mind that people don't like their client IO to stall for
more than a few seconds.

Something that might improve this booth in terms of speed and impact to
the clients would be something akin to the MD (linux software raid)
recovery logic.
As in, only one backfill operation per OSD (read or write, not both!) at
the same time.

Regards,

Christian

Cheers

More in another reply.

Cheers

On 26/08/2014 19:37, Craig Lewis wrote:

My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd
max backfills = 1).   I believe that increases my risk of failure by
48^2 .  Since your numbers are failure rate per hour per disk, I need
to consider the risk for the whole time for each disk.  So more
formally, rebuild time to the power of (replicas -1).

So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a
much higher risk than 1 / 10^8.


A risk of 1/43,000 means that I'm more likely to lose data due to
human error than disk failure.  Still, I can put a small bit of
effort in to optimize recovery speed, and lower this number.
Managing human error is much harder.






On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <l...@dachary.org
<mailto:l...@dachary.org>> wrote:

      Using percentages instead of numbers lead me to calculations
errors. Here it is again using 1/100 instead of % for clarity ;-)

      Assuming that:

      * The pool is configured for three replicas (size = 3 which is
the default)
      * It takes one hour for Ceph to recover from the loss of a single
OSD
      * Any other disk has a 1/100,000 chance to fail within the hour
following the failure of the first disk (assuming AFR
https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk
is 8%, divided by the number of hours during a year == (0.08 / 8760)
~= 1/100,000
      * A given disk does not participate in more than 100 PG

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best practice K/M-parameters EC pool

Reply via email to