Thanks, Sage! In the meanwhile I asked the same question in #Ceph IRC channel and Be_El gave me exactly the same answer, which helped. I also realized that in http://ceph.com/docs/master/rados/configuration/mon-osd-interaction/ it is stated: "You may change this grace period by adding an osd heartbeatgrace setting under the [osd] section of your Ceph configuration file, or by setting the value at runtime.". But in reality you must add this option to the [global] sections. Settinng this value in [osd] section only influenced only osd daemons, but not monitors.
Anyway, now IO resumes after only 5 seconds freeze. Thanks for help, guys! Regarding Ceph failure detection: in real environment it seems for me like 20-30 seconds of freeze after a single storage node outage is very expensive. Even when we talk about data consistency... 5 seconds is acceptable threshold. But, Sage, can you please explain in brief, what are the drawbacks of lowering the timeout? If for example I got stable 10 gig cluster network which is not likely to lag or interrupt - is 5 seconds dangerous anyhow? How OSDs can report false positives in that case? Thanks in advance :) On Wed, May 13, 2015 at 7:05 PM, Sage Weil <[email protected]> wrote: > On Wed, 13 May 2015, Vasiliy Angapov wrote: > > Hi, > > > > Well, I've managed to find out that correct stop of osd causes no IO > > downtime (/etc/init.d/ceph stop osd). But that cannot be called a fault > > tolerance, which Ceph is supposed to be.However, "killall -9 ceph-osd" > > causes IO to stop for about 20 seconds. > > > > I've tried lowering some timeouts but without luck. Here is a related > part > > of my ceph.conf after lowering the timeout values: > > > > [global] > > heartbeat interval = 5 > > mon osd down out interval = 90 > > mon pg warn max per osd = 2000 > > mon osd adjust heartbeat grace = false > > > > [client] > > rbd cache = false > > > > [mon] > > mon clock drift allowed = .200 > > mon osd min down reports = 1 > > > > [osd] > > osd heartbeat interval = 3 > > osd heartbeat grace = 5 > > > > Can you help me to reduce IO downtime somehow? Because 20 seconds for > > production is just horrible. > > You'll need to restart ceph-osd daemons for that change to take effect, or > > ceph tell osd.\* injectargs '--osd-heartbeat-grace 5 > --osd-heartbeat-interval 1' > > Just remember that this timeout is a tradeoff against false positives--be > careful tuning it too low. > > Note that ext4 going ro after 5 seconds sounds like insanity to me. I've > only seen this with older guest kernels, and iirc the problem is a > 120s timeout with ide or something? > > Ceph is a CP system that trades availability for consistency--it will > block IO as needed to ensure that it is handling reads or writes in a > completely consistent manner. Even if you get the failure detection > latency down, other recovery scenarios are likely to cross the magic 5s > threshold at some point and cause the same problem. You need to fix your > guests one way or another! > > sage > > > > > > Regards, Vasily. > > > > > > On Wed, May 13, 2015 at 9:57 AM, Vasiliy Angapov <[email protected]> > wrote: > > Thanks, Gregory! > > My Ceph version is 0.94.1. What I'm trying to test is the worst > > situation when the node is loosing network or becomes inresponsive. So > > what i do is "killall -9 ceph-osd", then reboot. > > > > Well, I also tried to do a clean reboot several times (just a "reboot" > > command), but i saw no difference - there is always an IO freeze for > > about 30 seconds. Btw, i'm using Fedora 20 on all nodes. > > > > Ok, I will play with timeouts more. > > > > Thanks again! > > > > On Wed, May 13, 2015 at 10:46 AM, Gregory Farnum <[email protected]> > > wrote: > > On Tue, May 12, 2015 at 11:39 PM, Vasiliy Angapov > > <[email protected]> wrote: > > > Hi, colleagues! > > > > > > I'm testing a simple Ceph cluster in order to use it in > > production > > > environment. I have 8 OSDs (1Tb SATA drives) which are > > evenly distributed > > > between 4 nodes. > > > > > > I'v mapped rbd image on the client node and started > > writing a lot of data to > > > it. Then I just reboot one node and see what's > > happening. What happens is > > > very sad. I have a write freeze for about 20-30 seconds > > which is enough for > > > ext4 filesystem to switch to RO. > > > > > > I wonder, if there is any way to minimize this lag? > > AFAIK, ext filesystems > > > have 5 seconds timeout before switching to RO. So is > > there any way to get > > > that lag beyond 5 secs? I've tried lowering different > > osd timeouts, but it > > > doesn't seem to help. > > > > > > How do you deal with such a situations? 20 seconds of > > downtime is not > > > tolerable in production. > > > > What version of Ceph are you running, and how are you rebooting > > it? > > Any newish version that gets a clean reboot will notify the > > cluster > > that it's shutting down, so you shouldn't witness blocked rights > > really at all. > > > > If you're doing a reboot that involves just ending the daemon, > > you > > will have to wait through the timeout period before the OSD gets > > marked down, which defaults to 30 seconds. This is adjustable > > (look > > for docs on the "osd heartbeat grace" config option), although > > if you > > set it too low you'll need to change a bunch of other timeouts > > which I > > don't know off-hand... > > -Greg > > > > > > > > > > >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
