Hi,

another possibility - the osd's "refusing connections" crashed, there's a
window of time where connection attempts will fail with connection refused,
in between osd died, the osd being re-started by upstart/systemd, and the
OSD gets far enough into it's init process to start listening for new
connections.

While your symptoms look the same, there's no guarantee that you're
suffering from the same problem, but.. we're currently suffering from
ceph-osd v12.2.4 sporadically segfaulting. Either for config reasons or the
signal handler fails to do it's thing, we don't get the typical "oops I
crashed" reports in the osd log, although journald/systemd did capture
stdout which mentions it and there's a kernel log message left behind
saying that ceph-osd segfaulted. (http://tracker.ceph.com/issues/23352 ).

-KJ

On Wed, Mar 28, 2018 at 10:50 AM, Andre Goree <[email protected]> wrote:

> On 2018/03/28 1:39 pm, Subhachandra Chandra wrote:
>
> We have seen similar behavior when there are network issues. AFAIK, the
>> OSD is being reported down by an OSD that cannot reach it. But either
>> another OSD that can reach it or the heartbeat between the OSD and the
>> monitor declares it up. The OSD "boot" message does not seem to indicate an
>> actual OSD restart.
>>
>> Subhachandra
>>
>> On Wed, Mar 28, 2018 at 10:30 AM, Andre Goree <[email protected]> wrote:
>>
>> Hello,
>>>
>>> I've recently had a minor issue come up where random individual OSDs are
>>> failed due to a connection refused on another OSD.  I say minor, bc it's
>>> not a node-wide issue, and appears to be random nodes -- and besides that,
>>> the OSD comes up within less than a second, as if the OSD is sent a
>>> "restart," or something.
>>>
>>> ...
>
>
> Great!  Thank you!  Yes I found it funny that it "restarted" so quickly,
> and from my readings I remember that it takes more than a single OSD
> heartbeat failing to produce and _actual_ failure, so as to prevent false
> positives.  Thanks for the insight!
>
>
>
> --
> Andre Goree
> -=-=-=-=-=-
> Email     - andre at drenet.net
> Website   - http://blog.drenet.net
> PGP key   - http://www.drenet.net/pubkey.html
> -=-=-=-=-=-
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Kjetil Joergensen <[email protected]>
SRE, Medallia Inc
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to