Re: [ceph-users] Random individual OSD failures with "connection refused reported by" another OSD?

Subhachandra Chandra Wed, 28 Mar 2018 10:40:28 -0700

We have seen similar behavior when there are network issues. AFAIK, the OSD
is being reported down by an OSD that cannot reach it. But either another
OSD that can reach it or the heartbeat between the OSD and the monitor
declares it up. The OSD "boot" message does not seem to indicate an actual
OSD restart.


Subhachandra

On Wed, Mar 28, 2018 at 10:30 AM, Andre Goree <[email protected]> wrote:

> Hello,
>
> I've recently had a minor issue come up where random individual OSDs are
> failed due to a connection refused on another OSD.  I say minor, bc it's
> not a node-wide issue, and appears to be random nodes -- and besides that,
> the OSD comes up within less than a second, as if the OSD is sent a
> "restart," or something.
>
> On the MON I see this (notice the entire time is ~0.77s):
>
> 2018-03-26 22:39:36.821247 mon.mon-01 [INF] osd.77 failed
> (root=default,host=osd-05) (connection refused reported by osd.55)
> 2018-03-26 22:39:36.935445 mon.mon-01 [WRN] Health check failed: 1 osds
> down (OSD_DOWN)
> 2018-03-26 22:39:39.959496 mon.mon-01 [WRN] Health check failed: Reduced
> data availability: 6 pgs peering (PG_AVAILABILITY)
> 2018-03-26 22:39:41.969578 mon.mon-01 [WRN] Health check failed: Degraded
> data redundancy: 6528/1742880 objects degraded (0.375%), 46 pgs degraded
> (PG_DEGRADED)
> 2018-03-26 22:39:43.978429 mon.mon-01 [INF] Health check cleared:
> PG_AVAILABILITY (was: Reduced data availability: 6 pgs peering)
> 2018-03-26 22:39:48.913112 mon.mon-01 [WRN] Health check update: Degraded
> data redundancy: 19411/1742880 objects degraded (1.114%), 136 pgs degraded
> (PG_DEGRADED)
> 2018-03-26 22:40:06.288138 mon.mon-01 [INF] Health check cleared: OSD_DOWN
> (was: 1 osds down)
> 2018-03-26 22:40:06.301955 mon.mon-01 [INF] osd.77
> 172.16.238.18:6818/57264 boot
> 2018-03-26 22:40:07.298884 mon.mon-01 [WRN] Health check update: Degraded
> data redundancy: 17109/1742880 objects degraded (0.982%), 120 pgs degraded
> (PG_DEGRADED)
> 2018-03-26 22:40:13.330362 mon.mon-01 [INF] Health check cleared:
> PG_DEGRADED (was: Degraded data redundancy: 5605/1742880 objects degraded
> (0.322%), 39 pgs degraded)
> 2018-03-26 22:40:13.330409 mon.mon-01 [INF] Cluster is now healthy
>
>
> On host osd-05 (which hosts osd.77) there appears to be normal heartbeat
> traffic before the OSD spontaneously reboots (truncated for brevity):
>
> 2018-03-26 22:33:00.773897 7efcaf20f700  0 -- 172.16.239.18:6818/7788 >>
> 172.16.239.21:6818/8675 conn(0x55c5ea2d9000 :6818
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=0).handle_connect_msg accept connect_seq 73 vs existing csq=73
> existing_state=STATE_STANDBY
> 2018-03-26 22:33:00.774124 7efcaf20f700  0 -- 172.16.239.18:6818/7788 >>
> 172.16.239.21:6818/8675 conn(0x55c5ea2d9000 :6818
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=0).handle_connect_msg accept connect_seq 74 vs existing csq=73
> existing_state=STATE_STANDBY
> 2018-03-26 22:39:56.832556 7f80ceea8e00  0 set uid:gid to 64045:64045
> (ceph:ceph)
> 2018-03-26 22:39:56.832576 7f80ceea8e00  0 ceph version 12.2.4
> (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable), process
> (unknown), pid 57264
> 2018-03-26 22:39:56.849487 7f80ceea8e00  0 pidfile_write: ignore empty
> --pid-file
> 2018-03-26 22:39:56.859045 7f80ceea8e00  0 load: jerasure load: lrc load:
> isa
> 2018-03-26 22:39:56.859122 7f80ceea8e00  1 bdev create path
> /var/lib/ceph/osd/ceph-77/block type kernel
> 2018-03-26 22:39:56.859135 7f80ceea8e00  1 bdev(0x5568ccae4d80
> /var/lib/ceph/osd/ceph-77/block) open path /var/lib/ceph/osd/ceph-77/block
> 2018-03-26 22:39:56.859398 7f80ceea8e00  1 bdev(0x5568ccae4d80
> /var/lib/ceph/osd/ceph-77/block) open size 8001559724032 (0x74702200000,
> 7452 GB) block_size 4096 (4096 B) rotational
> 2018-03-26 22:39:56.859711 7f80ceea8e00  1 
> bluestore(/var/lib/ceph/osd/ceph-77)
> _set_cache_sizes max 0.5 < ratio 0.99
> 2018-03-26 22:39:56.859733 7f80ceea8e00  1 
> bluestore(/var/lib/ceph/osd/ceph-77)
> _set_cache_sizes cache_size 1073741824 meta 0.5 kv 0.5 data 0
> 2018-03-26 22:39:56.859738 7f80ceea8e00  1 bdev(0x5568ccae4d80
> /var/lib/ceph/osd/ceph-77/block) close
> 2018-03-26 22:39:57.132071 7f80ceea8e00  1 
> bluestore(/var/lib/ceph/osd/ceph-77)
> _mount path /var/lib/ceph/osd/ceph-77
> 2018-03-26 22:39:57.132534 7f80ceea8e00  1 bdev create path
> /var/lib/ceph/osd/ceph-77/block type kernel
> ...truncated...
>
>
> Same on host osd-07 (which hosts osd.55, the one that reported connection
> refused), you'll see normal heartbeat traffic, followed by something
> interesting, before normal heartbeat traffic returns:
>
> 2018-03-26 22:33:20.598576 7f495c52e700  0 -- 172.16.239.20:6810/7206 >>
> 172.16.239.21:6816/8668 conn(0x5619e70c9800 :6810
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=0).handle_connect_msg accept connect_seq 42 vs existing csq=41
> existing_state=STATE_STANDBY
> 2018-03-26 22:39:36.974204 7f495cd2f700  0 -- 172.16.239.20:6810/7206 >>
> 172.16.239.17:6805/6991 conn(0x5619f476b000 :6810
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=0).handle_connect_msg accept connect_seq 57 vs existing csq=57
> existing_state=STATE_STANDBY
> 2018-03-26 22:39:36.974457 7f495cd2f700  0 -- 172.16.239.20:6810/7206 >>
> 172.16.239.17:6805/6991 conn(0x5619f476b000 :6810
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=0).handle_connect_msg accept connect_seq 58 vs existing csq=57
> existing_state=STATE_STANDBY
> 2018-03-26 22:39:36.978674 7f494851d700  1 osd.55 pg_epoch: 21688
> pg[1.714( v 21687'7063 (21590'5500,21687'7063] local-lis/les=21654/21655
> n=163 ec=480/480 lis/c 21654/21654 les/c/f 21655/21655/0 21688/21688/21654)
> [55,39] r=0 lpr=21688 pi=[21654,21688)/1 luod=0'0 crt=21687'7063 lcod
> 21687'7062 mlcod 0'0 active] start_peering_interval up [55,39,77] ->
> [55,39], acting [55,39,77] -> [55,39], acting_primary 55 -> 55, up_primary
> 55 -> 55, role 0 -> 0, features acting 2305244844532236283 upacting
> 2305244844532236283
> 2018-03-26 22:39:36.979216 7f494851d700  1 osd.55 pg_epoch: 21688
> pg[1.714( v 21687'7063 (21590'5500,21687'7063] local-lis/les=21654/21655
> n=163 ec=480/480 lis/c 21654/21654 les/c/f 21655/21655/0 21688/21688/21654)
> [55,39] r=0 lpr=21688 pi=[21654,21688)/1 crt=21687'7063 lcod 21687'7062
> mlcod 0'0 unknown] state<Start>: transitioning to Primary
> 2018-03-26 22:40:06.596889 7f495cd2f700  0 -- 172.16.239.20:6810/7206 >>
> 172.16.239.19:6810/7272 conn(0x5619e6f75000 :6810
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=0).handle_connect_msg accept connect_seq 113 vs existing csq=113
> existing_state=STATE_STANDBY
> 2018-03-26 22:40:06.597867 7f495cd2f700  0 -- 172.16.239.20:6810/7206 >>
> 172.16.239.19:6810/7272 conn(0x5619e6f75000 :6810
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=0).handle_connect_msg accept connect_seq 114 vs existing csq=113
> existing_state=STATE_STANDBY
>
>
> Any ideas of what might cause this?  Being that there's less than a second
> between the failure and the cluster returning to health, I'm wondering if I
> should even be concerned at all.  This was not happening a few days ago,
> however, and we did have networking issues before that we've since fixed.
>
> Thanks in advance.
>
> --
> Andre Goree
> -=-=-=-=-=-
> Email     - andre at drenet.net
> Website   - http://blog.drenet.net
> PGP key   - http://www.drenet.net/pubkey.html
> -=-=-=-=-=-
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Random individual OSD failures with "connection refused reported by" another OSD?

Reply via email to