Re: [ceph-users] How does monitor know OSD is dead?
I don't know if it's relevant here, but I saw similar behavior while implementing a Luminous->Nautilus automated upgrade test. When I used a single-node cluster with 4 OSDs, the Nautilus cluster would not function properly after the reboot. IIRC some OSDs were reported by "ceph -s" as up, even though they weren't running. I "fixed" the issue by adding a second node to the cluster. With two nodes (8 OSDs), the upgrade works fine. I will reproduce the issue again and open a bug report. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does monitor know OSD is dead?
> I'm a bit confused about what happened here, though: that 600 second > interval is only important if *every* OSD in the system is down. If you > reboot the data center, why didn't *any* OSD daemons start? (And even if > none did, having the ceph -s report all OSDs down instead of up isn't > going to change anything except whether your pager is going off, right?) I think you got lost in the thread of discussion. Enough OSDs for the cluster to be fully functional _did_ come back. But the cluster insisted on going to the dead ones (which it claimed all the while were up) for some I/O, even after running for 20 minutes that way, so the cluster was not functional. The 600 second "mon osd down out interval" was a red herring. It might be relevant that there was a grand total of three OSDs in the map. One came up; two did not. All objects were replicated across all three, with the hope that this sort of thing would not be fatal. It's a Jewel system with that version's default of 1 for "mon osd min down reporters". -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does monitor know OSD is dead?
On Sun, 30 Jun 2019, Bryan Henderson wrote: > > I'm not sure why the monitor did not mark it _out_ after 600 seconds > > (default) > > Well, that part I understand. The monitor didn't mark the OSD out because the > monitor still considered the OSD up. No reason to mark an up OSD out. > > I think the monitor should have marked the OSD down upon not hearing from it > for 15 minutes ("mon osd report interval"), then out 10 minutes after that > ("mon osd down out interval"). Yes--if it didn't, that a bug. Any logs would be helpful. I'm a bit confused about what happened here, though: that 600 second interval is only important if *every* OSD in the system is down. If you reboot the data center, why didn't *any* OSD daemons start? (And even if none did, having the ceph -s report all OSDs down instead of up isn't going to change anything except whether your pager is going off, right?) sage > > And that's worst case. Though details of how OSDs watch each other are vague, > I suspect an existing OSD was supposed to detect the dead OSDs and report that > to the monitor, which would believe it within about a minute and mark the OSDs > down. ("osd heartbeat interval", "mon osd min down reports", "mon osd min > down > reporters", "osd reporter subtree level"). > > -- > Bryan Henderson San Jose, California > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does monitor know OSD is dead?
Den ons 3 juli 2019 kl 05:41 skrev Bryan Henderson : > I may need to modify the above, though, now that I know how Ceph works, > because I've seen storage server products that use Ceph inside. However, > I'll > bet the people who buy those are not aware that it's designed never to go > down > and if something breaks while the system is coming up, a repair action may > be > necessary before data is accessible again. > I think you would be hard pressed to find any storage cluster who could not ever get into a situation where repair is needed before coming up again, given all the random events that might occur while a non-small number of members suffer from sudden power outages. I appreciate you had a bad experience, but don't believe that all others will gracefully and without issues automagically handle any kind of disturbances when parts of the clusters come up at different times and have their member disks checked at different speeds before being allowed in again. Not saying ceph is perfect, but work long enough in the storage sector and you'll see all kinds of odd surprises, and when total power loss happens, vendors are quite likely to shrug it off just like the replies you got here, in a "well don't get more outages" fashion. -- May the most significant bit of your life be positive. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does monitor know OSD is dead?
Here's some counter-evidence to the proposition that it's not pretty common for an entire cluster to go down because of a power failure. Every data center class hardware storage server product I know of has dual power input and is also designed to tolerate losing power on both at once. If that happens, they don't lose data and when the power comes back, they come back up all by themselves and start serving storage again. This design usually involves an expensive battery and maintenance procedure to make sure the battery gets replaced before it wears out (the battery is to keep the system up long enough to flush write buffers when the power fails), so users must think total power loss is a serious enough threat to pay for that. I may need to modify the above, though, now that I know how Ceph works, because I've seen storage server products that use Ceph inside. However, I'll bet the people who buy those are not aware that it's designed never to go down and if something breaks while the system is coming up, a repair action may be necessary before data is accessible again. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does monitor know OSD is dead?
I wouldn't say that's a pretty common failure. The flaw here perhaps is the design of the cluster and that it was relying on a single power source. Power sources fail. Dual power supplies connected to a b power sources in the data centre is pretty standard. On Tuesday, July 2, 2019, Bryan Henderson wrote: >> Normally in the case of a restart then somebody who used to have a >> connection to the OSD would still be running and flag it as dead. But >> if *all* the daemons in the cluster lose their soft state, that can't >> happen. > > OK, thanks. I guess that explains it. But that's a pretty serious design > flaw, isn't it? What I experienced is a pretty common failure mode: a power > outage caused the entire cluster to die simultaneously, then when power came > back, some OSDs didn't (the most common time for a server to fail is at > startup). > > I wonder if I could close this gap with additional monitoring of my own. I > could have a cluster bringup protocol that detects OSD processes that aren't > running after a while and mark those OSDs down. It would be cleaner, though, > if I could just find out from the monitor what OSDs are in the map but not > connected to the monitor cluster. Is that possible? > > A related question: If I mark an OSD down administratively, does it stay down > until I give a command to mark it back up, or will the monitor detect signs of > life and declare it up again on its own? > > -- > Bryan Henderson San Jose, California > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does monitor know OSD is dead?
> Normally in the case of a restart then somebody who used to have a > connection to the OSD would still be running and flag it as dead. But > if *all* the daemons in the cluster lose their soft state, that can't > happen. OK, thanks. I guess that explains it. But that's a pretty serious design flaw, isn't it? What I experienced is a pretty common failure mode: a power outage caused the entire cluster to die simultaneously, then when power came back, some OSDs didn't (the most common time for a server to fail is at startup). I wonder if I could close this gap with additional monitoring of my own. I could have a cluster bringup protocol that detects OSD processes that aren't running after a while and mark those OSDs down. It would be cleaner, though, if I could just find out from the monitor what OSDs are in the map but not connected to the monitor cluster. Is that possible? A related question: If I mark an OSD down administratively, does it stay down until I give a command to mark it back up, or will the monitor detect signs of life and declare it up again on its own? -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does monitor know OSD is dead?
On Sat, Jun 29, 2019 at 8:13 PM Bryan Henderson wrote: > > > I'm not sure why the monitor did not mark it _out_ after 600 seconds > > (default) > > Well, that part I understand. The monitor didn't mark the OSD out because the > monitor still considered the OSD up. No reason to mark an up OSD out. > > I think the monitor should have marked the OSD down upon not hearing from it > for 15 minutes ("mon osd report interval"), then out 10 minutes after that > ("mon osd down out interval"). It sounds like you had the whole cluster off and turned it on, and those servers didn't come up. This is why. The methods of detecting an OSD as down are 1) OSD heartbeat peers. That's as Robert describes (by default). 2) When an OSD is connected to a monitor, they heartbeat each other at very long intervals and the monitor flags the OSD down if it disappears and isn't connected to a different monitor. In your case, the OSD wasn't connected to any monitor, and it hadn't set up any heartbeat peers. Normally in the case of a restart then somebody who used to have a connection to the OSD would still be running and flag it as dead. But if *all* the daemons in the cluster lose their soft state, that can't happen. -Greg > > And that's worst case. Though details of how OSDs watch each other are vague, > I suspect an existing OSD was supposed to detect the dead OSDs and report that > to the monitor, which would believe it within about a minute and mark the OSDs > down. ("osd heartbeat interval", "mon osd min down reports", "mon osd min > down > reporters", "osd reporter subtree level"). > > -- > Bryan Henderson San Jose, California > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does monitor know OSD is dead?
On Sat, Jun 29, 2019 at 8:12 PM Bryan Henderson wrote: > > I'm not sure why the monitor did not mark it _out_ after 600 seconds > > (default) > > Well, that part I understand. The monitor didn't mark the OSD out because > the > monitor still considered the OSD up. No reason to mark an up OSD out. > > I think the monitor should have marked the OSD down upon not hearing from > it > for 15 minutes ("mon osd report interval"), then out 10 minutes after that > ("mon osd down out interval"). > > And that's worst case. Though details of how OSDs watch each other are > vague, > I suspect an existing OSD was supposed to detect the dead OSDs and report > that > to the monitor, which would believe it within about a minute and mark the > OSDs > down. ("osd heartbeat interval", "mon osd min down reports", "mon osd min > down > reporters", "osd reporter subtree level"). > > -- > Bryan Henderson San Jose, California > So, if an OSD (osd.1) misses three heartbeats (6 seconds each) from another OSD (osd.2), then the OSD sending the heartbeats (osd.2) tells the monitor that the OSD (osd.1) is down. It takes two OSDs from different CRUSH subtrees (host by default) for the monitor to mark the host down. The OSD is supposed to report to the monitor each time there is a change or every 120 seconds, if 600 seconds pass with the monitor not hearing from the OSD, it will mark it down. It 'should' only take 20 seconds to detect a downed OSD. Usually, the problem is that an OSD gets too busy and misses heartbeats so other OSDs wrongly mark them down. If 'nodown' is set, then the monitor will not mark OSDs down. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does monitor know OSD is dead?
> I'm not sure why the monitor did not mark it _out_ after 600 seconds > (default) Well, that part I understand. The monitor didn't mark the OSD out because the monitor still considered the OSD up. No reason to mark an up OSD out. I think the monitor should have marked the OSD down upon not hearing from it for 15 minutes ("mon osd report interval"), then out 10 minutes after that ("mon osd down out interval"). And that's worst case. Though details of how OSDs watch each other are vague, I suspect an existing OSD was supposed to detect the dead OSDs and report that to the monitor, which would believe it within about a minute and mark the OSDs down. ("osd heartbeat interval", "mon osd min down reports", "mon osd min down reporters", "osd reporter subtree level"). -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does monitor know OSD is dead?
On Sat, Jun 29, 2019 at 6:51 PM Bryan Henderson wrote: > > The reason it is so long is that you don't want to move data > > around unnecessarily if the osd is just being rebooted/restarted. > > I think you're confusing down with out. When an OSD is out, Ceph > backfills. While it is merely down, Ceph hopes that it will come back. > But it will direct I/O to other redundant OSDs instead of a down one. > > Going down leads to going out, and I believe that is the 600 seconds you > mention - the time between when the OSD is marked down and when Ceph marks > it > out (if all other conditions permit). > > There is a pretty good explanation of how OSDs get marked down, which is > pretty complicated, at > > > http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/ > > It just doesn't seem to match the implementation. > > -- > Bryan Henderson San Jose, California > I mixed up my terminology, the first line should have read: " I'm not sure why the monitor did not mark it _out_ after 600 seconds (default) " The "down timeout" I mention is the "mon osd down out interval". The rest of what I wrote is correct. Just to make sure I don't confuse anyone else. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does monitor know OSD is dead?
> The reason it is so long is that you don't want to move data > around unnecessarily if the osd is just being rebooted/restarted. I think you're confusing down with out. When an OSD is out, Ceph backfills. While it is merely down, Ceph hopes that it will come back. But it will direct I/O to other redundant OSDs instead of a down one. Going down leads to going out, and I believe that is the 600 seconds you mention - the time between when the OSD is marked down and when Ceph marks it out (if all other conditions permit). There is a pretty good explanation of how OSDs get marked down, which is pretty complicated, at http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/ It just doesn't seem to match the implementation. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does monitor know OSD is dead?
The thing i've seen a lot is where an OSD would get marked down because of a failed drive, then then it would add itself right back again On Fri, Jun 28, 2019 at 9:12 AM Robert LeBlanc wrote: > I'm not sure why the monitor did not mark it down after 600 seconds > (default). The reason it is so long is that you don't want to move data > around unnecessarily if the osd is just being rebooted/restarted. Usually, > you will still have min_size OSDs available for all PGs that will allow IO > to continue. Then when the down timeout expires it will start backfilling > and recovering the PGs that were affected. Double check that size != > min_size for your pools. > > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Thu, Jun 27, 2019 at 5:26 PM Bryan Henderson > wrote: > >> What does it take for a monitor to consider an OSD down which has been >> dead as >> a doornail since the cluster started? >> >> A couple of times, I have seen 'ceph status' report an OSD was up, when >> it was >> quite dead. Recently, a couple of OSDs were on machines that failed to >> boot >> up after a power failure. The rest of the Ceph cluster came up, though, >> and >> reported all OSDs up and in. I/Os stalled, probably because they were >> waiting >> for the dead OSDs to come back. >> >> I waited 15 minutes, because the manual says if the monitor doesn't hear a >> heartbeat from an OSD in that long (default value of >> mon_osd_report_timeout), >> it marks it down. But it didn't. I did "osd down" commands for the dead >> OSDs >> and the status changed to down and I/O started working. >> >> And wouldn't even 15 minutes of grace be unacceptable if it means I/Os >> have to >> wait that long before falling back to a redundant OSD? >> >> -- >> Bryan Henderson San Jose, California >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does monitor know OSD is dead?
I'm not sure why the monitor did not mark it down after 600 seconds (default). The reason it is so long is that you don't want to move data around unnecessarily if the osd is just being rebooted/restarted. Usually, you will still have min_size OSDs available for all PGs that will allow IO to continue. Then when the down timeout expires it will start backfilling and recovering the PGs that were affected. Double check that size != min_size for your pools. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Jun 27, 2019 at 5:26 PM Bryan Henderson wrote: > What does it take for a monitor to consider an OSD down which has been > dead as > a doornail since the cluster started? > > A couple of times, I have seen 'ceph status' report an OSD was up, when it > was > quite dead. Recently, a couple of OSDs were on machines that failed to > boot > up after a power failure. The rest of the Ceph cluster came up, though, > and > reported all OSDs up and in. I/Os stalled, probably because they were > waiting > for the dead OSDs to come back. > > I waited 15 minutes, because the manual says if the monitor doesn't hear a > heartbeat from an OSD in that long (default value of > mon_osd_report_timeout), > it marks it down. But it didn't. I did "osd down" commands for the dead > OSDs > and the status changed to down and I/O started working. > > And wouldn't even 15 minutes of grace be unacceptable if it means I/Os > have to > wait that long before falling back to a redundant OSD? > > -- > Bryan Henderson San Jose, California > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How does monitor know OSD is dead?
What does it take for a monitor to consider an OSD down which has been dead as a doornail since the cluster started? A couple of times, I have seen 'ceph status' report an OSD was up, when it was quite dead. Recently, a couple of OSDs were on machines that failed to boot up after a power failure. The rest of the Ceph cluster came up, though, and reported all OSDs up and in. I/Os stalled, probably because they were waiting for the dead OSDs to come back. I waited 15 minutes, because the manual says if the monitor doesn't hear a heartbeat from an OSD in that long (default value of mon_osd_report_timeout), it marks it down. But it didn't. I did "osd down" commands for the dead OSDs and the status changed to down and I/O started working. And wouldn't even 15 minutes of grace be unacceptable if it means I/Os have to wait that long before falling back to a redundant OSD? -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com