Re: [ceph-users] Write freeze when writing to rbd image and rebooting one of the nodes

Vasiliy Angapov Wed, 13 May 2015 09:30:31 -0700

Thanks, Sage!

In the meanwhile I asked the same question in #Ceph IRC channel and Be_El
gave me exactly the same answer, which helped.
I also realized that in
http://ceph.com/docs/master/rados/configuration/mon-osd-interaction/ it is
stated: "You may change this grace period by adding an osd
heartbeatgrace setting
under the [osd] section of your Ceph configuration file, or by setting the
value at runtime.". But in reality you must add this option to the [global]
sections. Settinng this value in [osd] section only influenced only osd
daemons, but not monitors.


Anyway, now IO resumes after only 5 seconds freeze. Thanks for help, guys!

Regarding Ceph failure detection: in real environment it seems for me like
20-30 seconds of freeze after a single storage node outage is very
expensive.
Even when we talk about data consistency...  5 seconds is acceptable
threshold.

But, Sage, can you please explain in brief, what are the drawbacks of
lowering the timeout? If for example I got stable 10 gig cluster network
which is not likely to lag or interrupt - is 5 seconds dangerous anyhow?
How OSDs can report false positives in that case?

Thanks in advance :)

On Wed, May 13, 2015 at 7:05 PM, Sage Weil <[email protected]> wrote:

> On Wed, 13 May 2015, Vasiliy Angapov wrote:
> > Hi,
> >
> > Well, I've managed to find out that correct stop of osd causes no IO
> > downtime (/etc/init.d/ceph stop osd). But that cannot be called a fault
> > tolerance, which Ceph is supposed to be.However, "killall -9 ceph-osd"
> > causes IO to stop for about 20 seconds.
> >
> > I've tried lowering some timeouts but without luck. Here is a related
> part
> > of my ceph.conf after lowering the timeout values:
> >
> > [global]
> > heartbeat interval = 5
> > mon osd down out interval = 90
> > mon pg warn max per osd = 2000
> > mon osd adjust heartbeat grace = false
> >
> > [client]
> > rbd cache = false
> >
> > [mon]
> > mon clock drift allowed = .200
> > mon osd min down reports = 1
> >
> > [osd]
> > osd heartbeat interval = 3
> > osd heartbeat grace = 5
> >
> > Can you help me to reduce IO downtime somehow? Because 20 seconds for
> > production is just horrible.
>
> You'll need to restart ceph-osd daemons for that change to take effect, or
>
>  ceph tell osd.\* injectargs '--osd-heartbeat-grace 5
> --osd-heartbeat-interval 1'
>
> Just remember that this timeout is a tradeoff against false positives--be
> careful tuning it too low.
>
> Note that ext4 going ro after 5 seconds sounds like insanity to me.  I've
> only seen this with older guest kernels, and iirc the problem is a
> 120s timeout with ide or something?
>
> Ceph is a CP system that trades availability for consistency--it will
> block IO as needed to ensure that it is handling reads or writes in a
> completely consistent manner.  Even if you get the failure detection
> latency down, other recovery scenarios are likely to cross the magic 5s
> threshold at some point and cause the same problem.  You need to fix your
> guests one way or another!
>
> sage
>
>
> >
> > Regards, Vasily.
> >
> >
> > On Wed, May 13, 2015 at 9:57 AM, Vasiliy Angapov <[email protected]>
> wrote:
> >       Thanks, Gregory!
> > My Ceph version is 0.94.1. What I'm trying to test is the worst
> > situation when the node is loosing network or becomes inresponsive. So
> > what i do is "killall -9 ceph-osd", then reboot.
> >
> > Well, I also tried to do a clean reboot several times (just a "reboot"
> > command), but i saw no difference - there is always an IO freeze for
> > about 30 seconds. Btw, i'm using Fedora 20 on all nodes.
> >
> > Ok, I will play with timeouts more.
> >
> > Thanks again!
> >
> > On Wed, May 13, 2015 at 10:46 AM, Gregory Farnum <[email protected]>
> > wrote:
> >       On Tue, May 12, 2015 at 11:39 PM, Vasiliy Angapov
> >       <[email protected]> wrote:
> >       > Hi, colleagues!
> >       >
> >       > I'm testing a simple Ceph cluster in order to use it in
> >       production
> >       > environment. I have 8 OSDs (1Tb SATA  drives) which are
> >       evenly distributed
> >       > between 4 nodes.
> >       >
> >       > I'v mapped rbd image on the client node and started
> >       writing a lot of data to
> >       > it. Then I just reboot one node and see what's
> >       happening. What happens is
> >       > very sad. I have a write freeze for about 20-30 seconds
> >       which is enough for
> >       > ext4 filesystem to switch to RO.
> >       >
> >       > I wonder, if there is any way to minimize this lag?
> >       AFAIK, ext filesystems
> >       > have 5 seconds timeout before switching to RO. So is
> >       there any way to get
> >       > that lag beyond 5 secs? I've tried lowering different
> >       osd timeouts, but it
> >       > doesn't seem to help.
> >       >
> >       > How do you deal with such a situations? 20 seconds of
> >       downtime is not
> >       > tolerable in production.
> >
> > What version of Ceph are you running, and how are you rebooting
> > it?
> > Any newish version that gets a clean reboot will notify the
> > cluster
> > that it's shutting down, so you shouldn't witness blocked rights
> > really at all.
> >
> > If you're doing a reboot that involves just ending the daemon,
> > you
> > will have to wait through the timeout period before the OSD gets
> > marked down, which defaults to 30 seconds. This is adjustable
> > (look
> > for docs on the "osd heartbeat grace" config option), although
> > if you
> > set it too low you'll need to change a bunch of other timeouts
> > which I
> > don't know off-hand...
> > -Greg
> >
> >
> >
> >
> >
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Write freeze when writing to rbd image and rebooting one of the nodes

Reply via email to