Re: [ceph-users] Flapping OSDs

Vlad Blando Fri, 07 Apr 2017 08:26:50 -0700

The issue is now fixed, it turns out i have unnecessary iptables rules,
flushed and deleted them all, restarted the OSDs and now they are running
normally.



ᐧ

Regards,

Vladimir FS Blando
Cloud Operations Manager
www.morphlabs.com

On Fri, Apr 7, 2017 at 1:17 PM, Vlad Blando <[email protected]> wrote:

> Hi Brian,
>
> Will check on that also.
>
>
>
> On Mon, Apr 3, 2017 at 4:53 PM, Brian : <[email protected]> wrote:
>
>> Hi Vlad
>>
>> Is there anything in syslog on any of the hosts when this happens?
>>
>> Had a similar issue with a single node recently and it was caused by a
>> firmware issue on a single ssd. That would cause the controller to reset
>> and osds on that node would flap as a result.
>>
>> flashed the SSD with new FW and issue hasn't come up since.
>>
>> Brian
>>
>>
>> On Mon, Apr 3, 2017 at 8:03 AM, Vlad Blando <[email protected]>
>> wrote:
>>
>>> Most of the time random and most of the time 1 at a time, but I also see
>>> 2-3 that are down at the same time.
>>>
>>> The network seems fine, the bond seems fine, I just don't know where to
>>> look anymore. My other option is to redo the server but that's the last
>>> resort, as much as possible I don't want to.
>>>
>>>
>>>
>>> On Mon, Apr 3, 2017 at 2:24 PM, Maxime Guyot <[email protected]>
>>> wrote:
>>>
>>>> Hi Vlad,
>>>>
>>>>
>>>>
>>>> I am curious if those OSDs are flapping all at once? If a single host
>>>> is affected I would consider the network connectivity (bottlenecks and
>>>> misconfigured bonds can generate strange situations), storage controller
>>>> and firmware.
>>>>
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Maxime
>>>>
>>>>
>>>>
>>>> *From: *ceph-users <[email protected]> on behalf of
>>>> Vlad Blando <[email protected]>
>>>> *Date: *Sunday 2 April 2017 16:28
>>>> *To: *ceph-users <[email protected]>
>>>> *Subject: *[ceph-users] Flapping OSDs
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> One of my ceph nodes have flapping OSDs, network between nodes are
>>>> fine, it's on a 10GBase-T network. I don't see anything wrong with the
>>>> network, but these OSDs are going up/down.
>>>>
>>>>
>>>>
>>>> [root@avatar0-ceph4 ~]# ceph osd tree
>>>>
>>>> # id    weight  type name       up/down reweight
>>>>
>>>> -1      174.7   root default
>>>>
>>>> -2      29.12           host avatar0-ceph2
>>>>
>>>> 16      3.64                    osd.16  up      1
>>>>
>>>> 17      3.64                    osd.17  up      1
>>>>
>>>> 18      3.64                    osd.18  up      1
>>>>
>>>> 19      3.64                    osd.19  up      1
>>>>
>>>> 20      3.64                    osd.20  up      1
>>>>
>>>> 21      3.64                    osd.21  up      1
>>>>
>>>> 22      3.64                    osd.22  up      1
>>>>
>>>> 23      3.64                    osd.23  up      1
>>>>
>>>> -3      29.12           host avatar0-ceph0
>>>>
>>>> 0       3.64                    osd.0   up      1
>>>>
>>>> 1       3.64                    osd.1   up      1
>>>>
>>>> 2       3.64                    osd.2   up      1
>>>>
>>>> 3       3.64                    osd.3   up      1
>>>>
>>>> 4       3.64                    osd.4   up      1
>>>>
>>>> 5       3.64                    osd.5   up      1
>>>>
>>>> 6       3.64                    osd.6   up      1
>>>>
>>>> 7       3.64                    osd.7   up      1
>>>>
>>>> -4      29.12           host avatar0-ceph1
>>>>
>>>> 8       3.64                    osd.8   up      1
>>>>
>>>> 9       3.64                    osd.9   up      1
>>>>
>>>> 10      3.64                    osd.10  up      1
>>>>
>>>> 11      3.64                    osd.11  up      1
>>>>
>>>> 12      3.64                    osd.12  up      1
>>>>
>>>> 13      3.64                    osd.13  up      1
>>>>
>>>> 14      3.64                    osd.14  up      1
>>>>
>>>> 15      3.64                    osd.15  up      1
>>>>
>>>> -5      29.12           host avatar0-ceph3
>>>>
>>>> 24      3.64                    osd.24  up      1
>>>>
>>>> 25      3.64                    osd.25  up      1
>>>>
>>>> 26      3.64                    osd.26  up      1
>>>>
>>>> 27      3.64                    osd.27  up      1
>>>>
>>>> 28      3.64                    osd.28  up      1
>>>>
>>>> 29      3.64                    osd.29  up      1
>>>>
>>>> 30      3.64                    osd.30  up      1
>>>>
>>>> 31      3.64                    osd.31  up      1
>>>>
>>>> -6      29.12           host avatar0-ceph4
>>>>
>>>> 32      3.64                    osd.32  up      1
>>>>
>>>> 33      3.64                    osd.33  up      1
>>>>
>>>> 34      3.64                    osd.34  up      1
>>>>
>>>> 35      3.64                    osd.35  up      1
>>>>
>>>> 36      3.64                    osd.36  up      1
>>>>
>>>> 37      3.64                    osd.37  up      1
>>>>
>>>> 38      3.64                    osd.38  up      1
>>>>
>>>> 39      3.64                    osd.39  up      1
>>>>
>>>> -7      29.12           host avatar0-ceph5
>>>>
>>>> 40      3.64                    osd.40  up      1
>>>>
>>>> 41      3.64                    osd.41  up      1
>>>>
>>>> 42      3.64                    osd.42  up      1
>>>>
>>>> 43      3.64                    osd.43  up      1
>>>>
>>>> 44      3.64                    osd.44  up      1
>>>>
>>>> 45      3.64                    osd.45  up      1
>>>>
>>>> 46      3.64                    osd.46  up      1
>>>>
>>>> 47      3.64                    osd.47  up      1
>>>>
>>>> [root@avatar0-ceph4 ~]#
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Here is my ceph.conf
>>>>
>>>> ---
>>>>
>>>> [root@avatar0-ceph4 ~]# cat /etc/ceph/ceph.conf
>>>>
>>>> [global]
>>>>
>>>> fsid = 2f0d1928-2ee5-4731-a259-64c0dc16110a
>>>>
>>>> mon_initial_members = avatar0-ceph0, avatar0-ceph1, avatar0-ceph2
>>>>
>>>> mon_host = 172.40.40.100,172.40.40.101,172.40.40.102
>>>>
>>>> auth_cluster_required = cephx
>>>>
>>>> auth_service_required = cephx
>>>>
>>>> auth_client_required = cephx
>>>>
>>>> filestore_xattr_use_omap = true
>>>>
>>>> osd_pool_default_size = 2
>>>>
>>>> osd_pool_default_min_size = 1
>>>>
>>>> cluster_network = 172.50.50.0/24
>>>>
>>>> public_network = 172.40.40.0/24
>>>>
>>>> max_open_files = 131072
>>>>
>>>> mon_clock_drift_allowed = .15
>>>>
>>>> mon_clock_drift_warn_backoff = 30
>>>>
>>>> mon_osd_down_out_interval = 300
>>>>
>>>> mon_osd_report_timeout = 300
>>>>
>>>> mon_osd_min_down_reporters = 3
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> [osd]
>>>>
>>>> filestore_merge_threshold = 40
>>>>
>>>> filestore_split_multiple = 8
>>>>
>>>> osd_op_threads = 8
>>>>
>>>> osd_max_backfills = 1
>>>>
>>>> osd_recovery_op_priority = 1
>>>>
>>>> osd_recovery_max_active = 1
>>>>
>>>>
>>>>
>>>> [client]
>>>>
>>>> rbd_cache = true
>>>>
>>>> rbd_cache_writethrough_until_flush = true
>>>>
>>>> ---
>>>>
>>>>
>>>>
>>>> Here's the log snippet on osd.34
>>>>
>>>> ---
>>>>
>>>> 2017-04-02 22:26:10.371282 7f1064eab700  0 --
>>>> 172.50.50.105:6816/117130897 >> 172.50.50.101:6808/190698
>>>> pipe(0x156a1b80 sd=124 :46536 s=2 pgs=966 cs=1 l=0 c=0x13ae19c0).fault with
>>>> nothing to send, going to standby
>>>>
>>>> 2017-04-02 22:26:10.371360 7f106ed5c700  0 --
>>>> 172.50.50.105:6816/117130897 >> 172.50.50.104:6822/181109
>>>> pipe(0x1018c2c0 sd=75 :34196 s=2 pgs=1022 cs=1 l=0 c=0x1098fa20).fault with
>>>> nothing to send, going to standby
>>>>
>>>> 2017-04-02 22:26:10.371393 7f1067ad2700  0 --
>>>> 172.50.50.105:6816/117130897 >> 172.50.50.103:6806/118813
>>>> pipe(0x166b5c80 sd=34 :34156 s=2 pgs=1041 cs=1 l=0 c=0x10c4bfa0).fault with
>>>> nothing to send, going to standby
>>>>
>>>> 2017-04-02 22:26:10.371739 7f107137b700  0 --
>>>> 172.50.50.105:6816/117130897 >> 172.50.50.103:6815/121286
>>>> pipe(0xd7eb8c0 sd=192 :43966 s=2 pgs=1042 cs=1 l=0 c=0x12eb7e40).fault with
>>>> nothing to send, going to standby
>>>>
>>>> 2017-04-02 22:26:10.375016 7f1068ff7700  0 --
>>>> 172.50.50.105:6816/117130897 >> 172.50.50.104:6812/183825
>>>> pipe(0x10f70c00 sd=61 :34442 s=2 pgs=1025 cs=1 l=0 c=0xb10be40).fault with
>>>> nothing to send, going to standby
>>>>
>>>> 2017-04-02 22:26:10.375221 7f107157d700  0 --
>>>> 172.50.50.105:6816/117130897 >> 172.50.50.102:6806/66401
>>>> pipe(0x10ba78c0 sd=191 :46312 s=2 pgs=988 cs=1 l=0 c=0x6f8c420).fault with
>>>> nothing to send, going to standby
>>>>
>>>> 2017-04-02 22:26:11.041747 7f10885ab700  0 log_channel(default) log
>>>> [WRN] : map e85725 wrongly marked me down
>>>>
>>>> 2017-04-02 22:26:16.427858 7f1062892700  0 --
>>>> 172.50.50.105:6807/118130897 >> 172.50.50.105:6811/116133701
>>>> pipe(0xd4cb180 sd=257 :6807 s=0 pgs=0 cs=0 l=0 c=0x13ae07e0).accept
>>>> connect_seq 0 vs existing 0 state connecting
>>>>
>>>> 2017-04-02 22:26:16.427897 7f1062993700  0 --
>>>> 172.50.50.105:6807/118130897 >> 172.50.50.105:6811/116133701
>>>> pipe(0xfb50680 sd=76 :56374 s=4 pgs=0 cs=0 l=0 c=0x174255a0).connect got
>>>> RESETSESSION but no longer connecting
>>>>
>>>> ---
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ᐧ
>>>>
>>>
>>> ᐧ
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> [email protected]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> ᐧ
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Flapping OSDs

Reply via email to