On Sun, 30 Jun 2019, Bryan Henderson wrote:
> > I'm not sure why the monitor did not mark it _out_ after 600 seconds
> > (default)
>
> Well, that part I understand. The monitor didn't mark the OSD out because the
> monitor still considered the OSD up. No reason to mark an up OSD out.
>
> I think the monitor should have marked the OSD down upon not hearing from it
> for 15 minutes ("mon osd report interval"), then out 10 minutes after that
> ("mon osd down out interval").
Yes--if it didn't, that a bug. Any logs would be helpful.
I'm a bit confused about what happened here, though: that 600 second
interval is only important if *every* OSD in the system is down. If you
reboot the data center, why didn't *any* OSD daemons start? (And even if
none did, having the ceph -s report all OSDs down instead of up isn't
going to change anything except whether your pager is going off, right?)
sage
>
> And that's worst case. Though details of how OSDs watch each other are vague,
> I suspect an existing OSD was supposed to detect the dead OSDs and report that
> to the monitor, which would believe it within about a minute and mark the OSDs
> down. ("osd heartbeat interval", "mon osd min down reports", "mon osd min
> down
> reporters", "osd reporter subtree level").
>
> --
> Bryan Henderson San Jose, California
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com