The issue is now fixed, it turns out i have unnecessary iptables rules, flushed and deleted them all, restarted the OSDs and now they are running normally.
ᐧ Regards, Vladimir FS Blando Cloud Operations Manager www.morphlabs.com On Fri, Apr 7, 2017 at 1:17 PM, Vlad Blando <[email protected]> wrote: > Hi Brian, > > Will check on that also. > > > > On Mon, Apr 3, 2017 at 4:53 PM, Brian : <[email protected]> wrote: > >> Hi Vlad >> >> Is there anything in syslog on any of the hosts when this happens? >> >> Had a similar issue with a single node recently and it was caused by a >> firmware issue on a single ssd. That would cause the controller to reset >> and osds on that node would flap as a result. >> >> flashed the SSD with new FW and issue hasn't come up since. >> >> Brian >> >> >> On Mon, Apr 3, 2017 at 8:03 AM, Vlad Blando <[email protected]> >> wrote: >> >>> Most of the time random and most of the time 1 at a time, but I also see >>> 2-3 that are down at the same time. >>> >>> The network seems fine, the bond seems fine, I just don't know where to >>> look anymore. My other option is to redo the server but that's the last >>> resort, as much as possible I don't want to. >>> >>> >>> >>> On Mon, Apr 3, 2017 at 2:24 PM, Maxime Guyot <[email protected]> >>> wrote: >>> >>>> Hi Vlad, >>>> >>>> >>>> >>>> I am curious if those OSDs are flapping all at once? If a single host >>>> is affected I would consider the network connectivity (bottlenecks and >>>> misconfigured bonds can generate strange situations), storage controller >>>> and firmware. >>>> >>>> >>>> >>>> Cheers, >>>> >>>> Maxime >>>> >>>> >>>> >>>> *From: *ceph-users <[email protected]> on behalf of >>>> Vlad Blando <[email protected]> >>>> *Date: *Sunday 2 April 2017 16:28 >>>> *To: *ceph-users <[email protected]> >>>> *Subject: *[ceph-users] Flapping OSDs >>>> >>>> >>>> >>>> Hi, >>>> >>>> >>>> >>>> One of my ceph nodes have flapping OSDs, network between nodes are >>>> fine, it's on a 10GBase-T network. I don't see anything wrong with the >>>> network, but these OSDs are going up/down. >>>> >>>> >>>> >>>> [root@avatar0-ceph4 ~]# ceph osd tree >>>> >>>> # id weight type name up/down reweight >>>> >>>> -1 174.7 root default >>>> >>>> -2 29.12 host avatar0-ceph2 >>>> >>>> 16 3.64 osd.16 up 1 >>>> >>>> 17 3.64 osd.17 up 1 >>>> >>>> 18 3.64 osd.18 up 1 >>>> >>>> 19 3.64 osd.19 up 1 >>>> >>>> 20 3.64 osd.20 up 1 >>>> >>>> 21 3.64 osd.21 up 1 >>>> >>>> 22 3.64 osd.22 up 1 >>>> >>>> 23 3.64 osd.23 up 1 >>>> >>>> -3 29.12 host avatar0-ceph0 >>>> >>>> 0 3.64 osd.0 up 1 >>>> >>>> 1 3.64 osd.1 up 1 >>>> >>>> 2 3.64 osd.2 up 1 >>>> >>>> 3 3.64 osd.3 up 1 >>>> >>>> 4 3.64 osd.4 up 1 >>>> >>>> 5 3.64 osd.5 up 1 >>>> >>>> 6 3.64 osd.6 up 1 >>>> >>>> 7 3.64 osd.7 up 1 >>>> >>>> -4 29.12 host avatar0-ceph1 >>>> >>>> 8 3.64 osd.8 up 1 >>>> >>>> 9 3.64 osd.9 up 1 >>>> >>>> 10 3.64 osd.10 up 1 >>>> >>>> 11 3.64 osd.11 up 1 >>>> >>>> 12 3.64 osd.12 up 1 >>>> >>>> 13 3.64 osd.13 up 1 >>>> >>>> 14 3.64 osd.14 up 1 >>>> >>>> 15 3.64 osd.15 up 1 >>>> >>>> -5 29.12 host avatar0-ceph3 >>>> >>>> 24 3.64 osd.24 up 1 >>>> >>>> 25 3.64 osd.25 up 1 >>>> >>>> 26 3.64 osd.26 up 1 >>>> >>>> 27 3.64 osd.27 up 1 >>>> >>>> 28 3.64 osd.28 up 1 >>>> >>>> 29 3.64 osd.29 up 1 >>>> >>>> 30 3.64 osd.30 up 1 >>>> >>>> 31 3.64 osd.31 up 1 >>>> >>>> -6 29.12 host avatar0-ceph4 >>>> >>>> 32 3.64 osd.32 up 1 >>>> >>>> 33 3.64 osd.33 up 1 >>>> >>>> 34 3.64 osd.34 up 1 >>>> >>>> 35 3.64 osd.35 up 1 >>>> >>>> 36 3.64 osd.36 up 1 >>>> >>>> 37 3.64 osd.37 up 1 >>>> >>>> 38 3.64 osd.38 up 1 >>>> >>>> 39 3.64 osd.39 up 1 >>>> >>>> -7 29.12 host avatar0-ceph5 >>>> >>>> 40 3.64 osd.40 up 1 >>>> >>>> 41 3.64 osd.41 up 1 >>>> >>>> 42 3.64 osd.42 up 1 >>>> >>>> 43 3.64 osd.43 up 1 >>>> >>>> 44 3.64 osd.44 up 1 >>>> >>>> 45 3.64 osd.45 up 1 >>>> >>>> 46 3.64 osd.46 up 1 >>>> >>>> 47 3.64 osd.47 up 1 >>>> >>>> [root@avatar0-ceph4 ~]# >>>> >>>> >>>> >>>> >>>> >>>> Here is my ceph.conf >>>> >>>> --- >>>> >>>> [root@avatar0-ceph4 ~]# cat /etc/ceph/ceph.conf >>>> >>>> [global] >>>> >>>> fsid = 2f0d1928-2ee5-4731-a259-64c0dc16110a >>>> >>>> mon_initial_members = avatar0-ceph0, avatar0-ceph1, avatar0-ceph2 >>>> >>>> mon_host = 172.40.40.100,172.40.40.101,172.40.40.102 >>>> >>>> auth_cluster_required = cephx >>>> >>>> auth_service_required = cephx >>>> >>>> auth_client_required = cephx >>>> >>>> filestore_xattr_use_omap = true >>>> >>>> osd_pool_default_size = 2 >>>> >>>> osd_pool_default_min_size = 1 >>>> >>>> cluster_network = 172.50.50.0/24 >>>> >>>> public_network = 172.40.40.0/24 >>>> >>>> max_open_files = 131072 >>>> >>>> mon_clock_drift_allowed = .15 >>>> >>>> mon_clock_drift_warn_backoff = 30 >>>> >>>> mon_osd_down_out_interval = 300 >>>> >>>> mon_osd_report_timeout = 300 >>>> >>>> mon_osd_min_down_reporters = 3 >>>> >>>> >>>> >>>> >>>> >>>> [osd] >>>> >>>> filestore_merge_threshold = 40 >>>> >>>> filestore_split_multiple = 8 >>>> >>>> osd_op_threads = 8 >>>> >>>> osd_max_backfills = 1 >>>> >>>> osd_recovery_op_priority = 1 >>>> >>>> osd_recovery_max_active = 1 >>>> >>>> >>>> >>>> [client] >>>> >>>> rbd_cache = true >>>> >>>> rbd_cache_writethrough_until_flush = true >>>> >>>> --- >>>> >>>> >>>> >>>> Here's the log snippet on osd.34 >>>> >>>> --- >>>> >>>> 2017-04-02 22:26:10.371282 7f1064eab700 0 -- >>>> 172.50.50.105:6816/117130897 >> 172.50.50.101:6808/190698 >>>> pipe(0x156a1b80 sd=124 :46536 s=2 pgs=966 cs=1 l=0 c=0x13ae19c0).fault with >>>> nothing to send, going to standby >>>> >>>> 2017-04-02 22:26:10.371360 7f106ed5c700 0 -- >>>> 172.50.50.105:6816/117130897 >> 172.50.50.104:6822/181109 >>>> pipe(0x1018c2c0 sd=75 :34196 s=2 pgs=1022 cs=1 l=0 c=0x1098fa20).fault with >>>> nothing to send, going to standby >>>> >>>> 2017-04-02 22:26:10.371393 7f1067ad2700 0 -- >>>> 172.50.50.105:6816/117130897 >> 172.50.50.103:6806/118813 >>>> pipe(0x166b5c80 sd=34 :34156 s=2 pgs=1041 cs=1 l=0 c=0x10c4bfa0).fault with >>>> nothing to send, going to standby >>>> >>>> 2017-04-02 22:26:10.371739 7f107137b700 0 -- >>>> 172.50.50.105:6816/117130897 >> 172.50.50.103:6815/121286 >>>> pipe(0xd7eb8c0 sd=192 :43966 s=2 pgs=1042 cs=1 l=0 c=0x12eb7e40).fault with >>>> nothing to send, going to standby >>>> >>>> 2017-04-02 22:26:10.375016 7f1068ff7700 0 -- >>>> 172.50.50.105:6816/117130897 >> 172.50.50.104:6812/183825 >>>> pipe(0x10f70c00 sd=61 :34442 s=2 pgs=1025 cs=1 l=0 c=0xb10be40).fault with >>>> nothing to send, going to standby >>>> >>>> 2017-04-02 22:26:10.375221 7f107157d700 0 -- >>>> 172.50.50.105:6816/117130897 >> 172.50.50.102:6806/66401 >>>> pipe(0x10ba78c0 sd=191 :46312 s=2 pgs=988 cs=1 l=0 c=0x6f8c420).fault with >>>> nothing to send, going to standby >>>> >>>> 2017-04-02 22:26:11.041747 7f10885ab700 0 log_channel(default) log >>>> [WRN] : map e85725 wrongly marked me down >>>> >>>> 2017-04-02 22:26:16.427858 7f1062892700 0 -- >>>> 172.50.50.105:6807/118130897 >> 172.50.50.105:6811/116133701 >>>> pipe(0xd4cb180 sd=257 :6807 s=0 pgs=0 cs=0 l=0 c=0x13ae07e0).accept >>>> connect_seq 0 vs existing 0 state connecting >>>> >>>> 2017-04-02 22:26:16.427897 7f1062993700 0 -- >>>> 172.50.50.105:6807/118130897 >> 172.50.50.105:6811/116133701 >>>> pipe(0xfb50680 sd=76 :56374 s=4 pgs=0 cs=0 l=0 c=0x174255a0).connect got >>>> RESETSESSION but no longer connecting >>>> >>>> --- >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> ᐧ >>>> >>> >>> ᐧ >>> >>> _______________________________________________ >>> ceph-users mailing list >>> [email protected] >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >> >> _______________________________________________ >> ceph-users mailing list >> [email protected] >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > ᐧ >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
