Re: [ceph-users] How does monitor know OSD is dead?

2019-07-10 Thread Nathan Cutler
I don't know if it's relevant here, but I saw similar behavior while 
implementing
a Luminous->Nautilus automated upgrade test. When I used a single-node cluster
with 4 OSDs, the Nautilus cluster would not function properly after the reboot.
IIRC some OSDs were reported by "ceph -s" as up, even though they weren't 
running.

I "fixed" the issue by adding a second node to the cluster. With two nodes (8
OSDs), the upgrade works fine.

I will reproduce the issue again and open a bug report.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-07-03 Thread Bryan Henderson
> I'm a bit confused about what happened here, though: that 600 second 
> interval is only important if *every* OSD in the system is down.  If you 
> reboot the data center, why didn't *any* OSD daemons start?  (And even if 
> none did, having the ceph -s report all OSDs down instead of up isn't 
> going to change anything except whether your pager is going off, right?)

I think you got lost in the thread of discussion.  Enough OSDs for the cluster
to be fully functional _did_ come back.  But the cluster insisted on going to
the dead ones (which it claimed all the while were up) for some I/O, even
after running for 20 minutes that way, so the cluster was not functional.  The
600 second "mon osd down out interval" was a red herring.

It might be relevant that there was a grand total of three OSDs in the map.
One came up; two did not.  All objects were replicated across all three, with
the hope that this sort of thing would not be fatal.  It's a Jewel system with
that version's default of 1 for "mon osd min down reporters".

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-07-03 Thread Sage Weil
On Sun, 30 Jun 2019, Bryan Henderson wrote:
> > I'm not sure why the monitor did not mark it _out_ after 600 seconds
> > (default)
> 
> Well, that part I understand.  The monitor didn't mark the OSD out because the
> monitor still considered the OSD up.  No reason to mark an up OSD out.
> 
> I think the monitor should have marked the OSD down upon not hearing from it
> for 15 minutes ("mon osd report interval"), then out 10 minutes after that
> ("mon osd down out interval").

Yes--if it didn't, that a bug.  Any logs would be helpful.

I'm a bit confused about what happened here, though: that 600 second 
interval is only important if *every* OSD in the system is down.  If you 
reboot the data center, why didn't *any* OSD daemons start?  (And even if 
none did, having the ceph -s report all OSDs down instead of up isn't 
going to change anything except whether your pager is going off, right?)

sage

 > 
> And that's worst case.  Though details of how OSDs watch each other are vague,
> I suspect an existing OSD was supposed to detect the dead OSDs and report that
> to the monitor, which would believe it within about a minute and mark the OSDs
> down.  ("osd heartbeat interval", "mon osd min down reports", "mon osd min 
> down
> reporters", "osd reporter subtree level").
> 
> -- 
> Bryan Henderson   San Jose, California
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-07-03 Thread Janne Johansson
Den ons 3 juli 2019 kl 05:41 skrev Bryan Henderson :

> I may need to modify the above, though, now that I know how Ceph works,
> because I've seen storage server products that use Ceph inside.  However,
> I'll
> bet the people who buy those are not aware that it's designed never to go
> down
> and if something breaks while the system is coming up, a repair action may
> be
> necessary before data is accessible again.
>

I think you would be hard pressed to find any storage cluster who could not
ever get
into a situation where repair is needed before coming up again, given all
the random
events that might occur while a non-small number of members suffer from
sudden
power outages.

I appreciate you had a bad experience, but don't believe that all others
will gracefully and
without issues automagically handle any kind of disturbances when parts of
the clusters
come up at different times and have their member disks checked at different
speeds
before being allowed in again.

Not saying ceph is perfect, but work long enough in the storage sector and
you'll see all
kinds of odd surprises, and when total power loss happens, vendors are
quite likely to shrug
it off just like the replies you got here, in a "well don't get more
outages" fashion.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-07-02 Thread Bryan Henderson
Here's some counter-evidence to the proposition that it's not pretty common
for an entire cluster to go down because of a power failure.

Every data center class hardware storage server product I know of has dual
power input and is also designed to tolerate losing power on both at once.  If
that happens, they don't lose data and when the power comes back, they come
back up all by themselves and start serving storage again.

This design usually involves an expensive battery and maintenance procedure to
make sure the battery gets replaced before it wears out (the battery is to
keep the system up long enough to flush write buffers when the power fails),
so users must think total power loss is a serious enough threat to pay for
that.

I may need to modify the above, though, now that I know how Ceph works,
because I've seen storage server products that use Ceph inside.  However, I'll
bet the people who buy those are not aware that it's designed never to go down
and if something breaks while the system is coming up, a repair action may be
necessary before data is accessible again.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-07-02 Thread Brian :
I wouldn't say that's a pretty common failure. The flaw here perhaps is the
design of the cluster and that it was relying on a single power source.
Power sources fail. Dual power supplies connected to a b power sources in
the data centre is pretty standard.

On Tuesday, July 2, 2019, Bryan Henderson  wrote:
>> Normally in the case of a restart then somebody who used to have a
>> connection to the OSD would still be running and flag it as dead. But
>> if *all* the daemons in the cluster lose their soft state, that can't
>> happen.
>
> OK, thanks.  I guess that explains it.  But that's a pretty serious design
> flaw, isn't it?  What I experienced is a pretty common failure mode: a
power
> outage caused the entire cluster to die simultaneously, then when power
came
> back, some OSDs didn't (the most common time for a server to fail is at
> startup).
>
> I wonder if I could close this gap with additional monitoring of my own.
I
> could have a cluster bringup protocol that detects OSD processes that
aren't
> running after a while and mark those OSDs down.  It would be cleaner,
though,
> if I could just find out from the monitor what OSDs are in the map but not
> connected to the monitor cluster.  Is that possible?
>
> A related question: If I mark an OSD down administratively, does it stay
down
> until I give a command to mark it back up, or will the monitor detect
signs of
> life and declare it up again on its own?
>
> --
> Bryan Henderson   San Jose, California
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-07-01 Thread Bryan Henderson
> Normally in the case of a restart then somebody who used to have a
> connection to the OSD would still be running and flag it as dead. But
> if *all* the daemons in the cluster lose their soft state, that can't
> happen.

OK, thanks.  I guess that explains it.  But that's a pretty serious design
flaw, isn't it?  What I experienced is a pretty common failure mode: a power
outage caused the entire cluster to die simultaneously, then when power came
back, some OSDs didn't (the most common time for a server to fail is at
startup).

I wonder if I could close this gap with additional monitoring of my own.  I
could have a cluster bringup protocol that detects OSD processes that aren't
running after a while and mark those OSDs down.  It would be cleaner, though,
if I could just find out from the monitor what OSDs are in the map but not
connected to the monitor cluster.  Is that possible?

A related question: If I mark an OSD down administratively, does it stay down
until I give a command to mark it back up, or will the monitor detect signs of
life and declare it up again on its own?

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-07-01 Thread Gregory Farnum
On Sat, Jun 29, 2019 at 8:13 PM Bryan Henderson  wrote:
>
> > I'm not sure why the monitor did not mark it _out_ after 600 seconds
> > (default)
>
> Well, that part I understand.  The monitor didn't mark the OSD out because the
> monitor still considered the OSD up.  No reason to mark an up OSD out.
>
> I think the monitor should have marked the OSD down upon not hearing from it
> for 15 minutes ("mon osd report interval"), then out 10 minutes after that
> ("mon osd down out interval").

It sounds like you had the whole cluster off and turned it on, and
those servers didn't come up. This is why.

The methods of detecting an OSD as down are
1) OSD heartbeat peers. That's as Robert describes (by default).
2) When an OSD is connected to a monitor, they heartbeat each other at
very long intervals and the monitor flags the OSD down if it
disappears and isn't connected to a different monitor.

In your case, the OSD wasn't connected to any monitor, and it hadn't
set up any heartbeat peers.

Normally in the case of a restart then somebody who used to have a
connection to the OSD would still be running and flag it as dead. But
if *all* the daemons in the cluster lose their soft state, that can't
happen.
-Greg

>
> And that's worst case.  Though details of how OSDs watch each other are vague,
> I suspect an existing OSD was supposed to detect the dead OSDs and report that
> to the monitor, which would believe it within about a minute and mark the OSDs
> down.  ("osd heartbeat interval", "mon osd min down reports", "mon osd min 
> down
> reporters", "osd reporter subtree level").
>
> --
> Bryan Henderson   San Jose, California
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-06-29 Thread Robert LeBlanc
On Sat, Jun 29, 2019 at 8:12 PM Bryan Henderson 
wrote:

> > I'm not sure why the monitor did not mark it _out_ after 600 seconds
> > (default)
>
> Well, that part I understand.  The monitor didn't mark the OSD out because
> the
> monitor still considered the OSD up.  No reason to mark an up OSD out.
>
> I think the monitor should have marked the OSD down upon not hearing from
> it
> for 15 minutes ("mon osd report interval"), then out 10 minutes after that
> ("mon osd down out interval").
>
> And that's worst case.  Though details of how OSDs watch each other are
> vague,
> I suspect an existing OSD was supposed to detect the dead OSDs and report
> that
> to the monitor, which would believe it within about a minute and mark the
> OSDs
> down.  ("osd heartbeat interval", "mon osd min down reports", "mon osd min
> down
> reporters", "osd reporter subtree level").
>
> --
> Bryan Henderson   San Jose, California
>

So, if an OSD (osd.1) misses three heartbeats (6 seconds each) from another
OSD (osd.2), then the OSD sending the heartbeats (osd.2) tells the monitor
that the OSD (osd.1) is down. It takes two OSDs from different CRUSH
subtrees (host by default) for the monitor to mark the host down. The OSD
is supposed to report to the monitor each time there is a change or every
120 seconds, if 600 seconds pass with the monitor not hearing from the OSD,
it will mark it down. It 'should' only take 20 seconds to detect a downed
OSD.

Usually, the problem is that an OSD gets too busy and misses heartbeats so
other OSDs wrongly mark them down.

If 'nodown' is set, then the monitor will not mark OSDs down.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-06-29 Thread Bryan Henderson
> I'm not sure why the monitor did not mark it _out_ after 600 seconds
> (default)

Well, that part I understand.  The monitor didn't mark the OSD out because the
monitor still considered the OSD up.  No reason to mark an up OSD out.

I think the monitor should have marked the OSD down upon not hearing from it
for 15 minutes ("mon osd report interval"), then out 10 minutes after that
("mon osd down out interval").

And that's worst case.  Though details of how OSDs watch each other are vague,
I suspect an existing OSD was supposed to detect the dead OSDs and report that
to the monitor, which would believe it within about a minute and mark the OSDs
down.  ("osd heartbeat interval", "mon osd min down reports", "mon osd min down
reporters", "osd reporter subtree level").

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-06-29 Thread Robert LeBlanc
On Sat, Jun 29, 2019 at 6:51 PM Bryan Henderson 
wrote:

> > The reason it is so long is that you don't want to move data
> > around unnecessarily if the osd is just being rebooted/restarted.
>
> I think you're confusing down with out.  When an OSD is out, Ceph
> backfills.  While it is merely down, Ceph hopes that it will come back.
> But it will direct I/O to other redundant OSDs instead of a down one.
>
> Going down leads to going out, and I believe that is the 600 seconds you
> mention - the time between when the OSD is marked down and when Ceph marks
> it
> out (if all other conditions permit).
>
> There is a pretty good explanation of how OSDs get marked down, which is
> pretty complicated, at
>
>
> http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/
>
> It just doesn't seem to match the implementation.
>
> --
> Bryan Henderson   San Jose, California
>

I mixed up my terminology, the first line should have read:
" I'm not sure why the monitor did not mark it _out_ after 600 seconds
(default) "

The "down timeout" I mention is the "mon osd down out interval".

The rest of what I wrote is correct. Just to make sure I don't confuse
anyone else.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-06-29 Thread Bryan Henderson
> The reason it is so long is that you don't want to move data
> around unnecessarily if the osd is just being rebooted/restarted. 

I think you're confusing down with out.  When an OSD is out, Ceph
backfills.  While it is merely down, Ceph hopes that it will come back.
But it will direct I/O to other redundant OSDs instead of a down one.

Going down leads to going out, and I believe that is the 600 seconds you
mention - the time between when the OSD is marked down and when Ceph marks it
out (if all other conditions permit).

There is a pretty good explanation of how OSDs get marked down, which is
pretty complicated, at

  http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/

It just doesn't seem to match the implementation.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-06-28 Thread solarflow99
The thing i've seen a lot is where an OSD would get marked down because of
a failed drive, then then it would add itself right back again


On Fri, Jun 28, 2019 at 9:12 AM Robert LeBlanc  wrote:

> I'm not sure why the monitor did not mark it down after 600 seconds
> (default). The reason it is so long is that you don't want to move data
> around unnecessarily if the osd is just being rebooted/restarted. Usually,
> you will still have min_size OSDs available for all PGs that will allow IO
> to continue. Then when the down timeout expires it will start backfilling
> and recovering the PGs that were affected. Double check that size !=
> min_size for your pools.
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Thu, Jun 27, 2019 at 5:26 PM Bryan Henderson 
> wrote:
>
>> What does it take for a monitor to consider an OSD down which has been
>> dead as
>> a doornail since the cluster started?
>>
>> A couple of times, I have seen 'ceph status' report an OSD was up, when
>> it was
>> quite dead.  Recently, a couple of OSDs were on machines that failed to
>> boot
>> up after a power failure.  The rest of the Ceph cluster came up, though,
>> and
>> reported all OSDs up and in.  I/Os stalled, probably because they were
>> waiting
>> for the dead OSDs to come back.
>>
>> I waited 15 minutes, because the manual says if the monitor doesn't hear a
>> heartbeat from an OSD in that long (default value of
>> mon_osd_report_timeout),
>> it marks it down.  But it didn't.  I did "osd down" commands for the dead
>> OSDs
>> and the status changed to down and I/O started working.
>>
>> And wouldn't even 15 minutes of grace be unacceptable if it means I/Os
>> have to
>> wait that long before falling back to a redundant OSD?
>>
>> --
>> Bryan Henderson   San Jose, California
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-06-28 Thread Robert LeBlanc
I'm not sure why the monitor did not mark it down after 600 seconds
(default). The reason it is so long is that you don't want to move data
around unnecessarily if the osd is just being rebooted/restarted. Usually,
you will still have min_size OSDs available for all PGs that will allow IO
to continue. Then when the down timeout expires it will start backfilling
and recovering the PGs that were affected. Double check that size !=
min_size for your pools.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Jun 27, 2019 at 5:26 PM Bryan Henderson 
wrote:

> What does it take for a monitor to consider an OSD down which has been
> dead as
> a doornail since the cluster started?
>
> A couple of times, I have seen 'ceph status' report an OSD was up, when it
> was
> quite dead.  Recently, a couple of OSDs were on machines that failed to
> boot
> up after a power failure.  The rest of the Ceph cluster came up, though,
> and
> reported all OSDs up and in.  I/Os stalled, probably because they were
> waiting
> for the dead OSDs to come back.
>
> I waited 15 minutes, because the manual says if the monitor doesn't hear a
> heartbeat from an OSD in that long (default value of
> mon_osd_report_timeout),
> it marks it down.  But it didn't.  I did "osd down" commands for the dead
> OSDs
> and the status changed to down and I/O started working.
>
> And wouldn't even 15 minutes of grace be unacceptable if it means I/Os
> have to
> wait that long before falling back to a redundant OSD?
>
> --
> Bryan Henderson   San Jose, California
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How does monitor know OSD is dead?

2019-06-27 Thread Bryan Henderson
What does it take for a monitor to consider an OSD down which has been dead as
a doornail since the cluster started?

A couple of times, I have seen 'ceph status' report an OSD was up, when it was
quite dead.  Recently, a couple of OSDs were on machines that failed to boot
up after a power failure.  The rest of the Ceph cluster came up, though, and
reported all OSDs up and in.  I/Os stalled, probably because they were waiting
for the dead OSDs to come back.

I waited 15 minutes, because the manual says if the monitor doesn't hear a
heartbeat from an OSD in that long (default value of mon_osd_report_timeout),
it marks it down.  But it didn't.  I did "osd down" commands for the dead OSDs
and the status changed to down and I/O started working.

And wouldn't even 15 minutes of grace be unacceptable if it means I/Os have to
wait that long before falling back to a redundant OSD?

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com