Re: [ceph-users] OSDs wrongly marked down
Could also be your hardware under powered for the io you have. try to check your resource load during peak workload together with recovery and scrubbing going on at same time. On 2017-12-20 17:03, David Turner wrote: > When I have OSDs wrongly marked down it's usually to do with the > filestore_split_multiple and filestore_merge_threshold in a thing I call PG > subfolder splitting. This is no longer a factor with bluestore, but as > you're running hammer, it's worth a look. > http://docs.ceph.com/docs/hammer/rados/configuration/filestore-config-ref/ > > On Wed, Dec 20, 2017 at 9:31 AM Garuti, Lorenzowrote: > Hi Sergio, > > in my case it was a network problem, occasionally (due to network problems) > mon.{id} can't reach osd.{id}. > The massage fault, initiating reconnect and failed lossy con in your logs > suggest a network problem. > > See also: > > http://docs.ceph.com/docs/giant/rados/troubleshooting/troubleshooting-osd/#flapping-osds > > https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/troubleshooting_guide/troubleshooting-osds#flapping-osds > > > Lorenzo > > 2017-12-20 15:13 GMT+01:00 Sergio Morales : > > Hi. > > I'm having problem with the OSD en my cluster. > > Randomly some OSD get wrongly marked down. I set my "mon osd min down > reporters " to OSD +1, but i still get this problem. > > Any tips or ideas to do the troubleshooting? I'm using Ceph 0.94.5 on Centos > 7. > > The logs shows this: > > 2017-12-19 16:59:26.357707 7fa9177d3700 0 -- 172.17.4.2:6830/4775054 [1] >> > 172.17.4.3:6800/2009784 [2] pipe(0x7fa8a0907000 sd=43 :45955 s=1 pgs=1089 > cs=1 l=0 c=0x7fa8a0965f00).connect got RESETSESSION > 2017-12-19 16:59:26.360240 7fa8e5652700 0 -- 172.17.4.2:6830/4775054 [1] >> > 172.17.4.1:6808/6007742 [3] pipe(0x7fa9310e3000 sd=26 :53375 s=2 pgs=5272 > cs=1 l=0 c=0x7fa931045680).fault, initiating reconnect > > 2017-12-19 16:59:25.716758 7fa8e74c1700 0 -- 172.17.4.2:6830/4775054 [1] >> > 172.17.4.1:6826/1007559 [4] pipe(0x7fa907052000 sd=17 :45743 s=1 pgs=2105 > cs=1 l=0 c=0x7fa8a051a180).connect got RESETSESSION > 2017-12-19 16:59:25.716308 7fa9849ed700 0 -- 172.17.3.2:6802/3775054 [5] > submit_message osd_op_reply(392 rbd_data.129d2042eabc234.0605 > [set-alloc-hint object_size 4194304 write_size 4194304,write 0~126976] > v26497'18879046 uv18879046 ondisk = 0) v6 remote, 172.17.1.3:0/5911141 [6], > failed lossy con, dropping message 0x7fa8830edb00 > 2017-12-19 16:59:25.718694 7fa9849ed700 0 -- 172.17.3.2:6802/3775054 [5] > submit_message osd_op_reply(10610054 rbd_data.6ccd3348ab9aac.011d > [set-alloc-hint object_size 8388608 write_size 8388608,write 876544~4096] > v26497'15075797 uv15075797 ondisk = 0) v6 remote, 172.17.1.4:0/1028032 [7], > failed lossy con, dropping message 0x7fa87a911700 > > -- > > Sergio A. Morales > Ingeniero de Sistemas > LINETS CHILE - 56 2 2412 5858 > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- > > Lorenzo Garuti > CED MaxMara > email: garut...@maxmara.it > tel: 0522 3993772 - 335 8416054 > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Links: -- [1] http://172.17.4.2:6830/4775054 [2] http://172.17.4.3:6800/2009784 [3] http://172.17.4.1:6808/6007742 [4] http://172.17.4.1:6826/1007559 [5] http://172.17.3.2:6802/3775054 [6] http://172.17.1.3:0/5911141 [7] http://172.17.1.4:0/1028032___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSDs wrongly marked down
When I have OSDs wrongly marked down it's usually to do with the filestore_split_multiple and filestore_merge_threshold in a thing I call PG subfolder splitting. This is no longer a factor with bluestore, but as you're running hammer, it's worth a look. http://docs.ceph.com/docs/hammer/rados/configuration/filestore-config-ref/ On Wed, Dec 20, 2017 at 9:31 AM Garuti, Lorenzowrote: > Hi Sergio, > > in my case it was a network problem, occasionally (due to network > problems) mon.{id} can't reach osd.{id}. > The massage fault, initiating reconnect and failed lossy con in your > logs suggest a network problem. > > See also: > > > http://docs.ceph.com/docs/giant/rados/troubleshooting/troubleshooting-osd/#flapping-osds > > https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/troubleshooting_guide/troubleshooting-osds#flapping-osds > > Lorenzo > > 2017-12-20 15:13 GMT+01:00 Sergio Morales : > >> Hi. >> >> I'm having problem with the OSD en my cluster. >> >> >> Randomly some OSD get wrongly marked down. I set my "mon osd min down >> reporters " to OSD +1, but i still get this problem. >> >> Any tips or ideas to do the troubleshooting? I'm using Ceph 0.94.5 on >> Centos 7. >> >> The logs shows this: >> >> 2017-12-19 16:59:26.357707 7fa9177d3700 0 -- 172.17.4.2:6830/4775054 >> >> 172.17.4.3:6800/2009784 pipe(0x7fa8a0907000 sd=43 :45955 s=1 pgs=1089 >> cs=1 l=0 c=0x7fa8a0965f00).connect got RESETSESSION >> 2017-12-19 16:59:26.360240 7fa8e5652700 0 -- 172.17.4.2:6830/4775054 >> >> 172.17.4.1:6808/6007742 pipe(0x7fa9310e3000 sd=26 :53375 s=2 pgs=5272 >> cs=1 l=0 c=0x7fa931045680).fault, initiating reconnect >> >> 2017-12-19 16:59:25.716758 7fa8e74c1700 0 -- 172.17.4.2:6830/4775054 >> >> 172.17.4.1:6826/1007559 pipe(0x7fa907052000 sd=17 :45743 s=1 pgs=2105 >> cs=1 l=0 c=0x7fa8a051a180).connect got RESETSESSION >> 2017-12-19 16:59:25.716308 7fa9849ed700 0 -- 172.17.3.2:6802/3775054 >> submit_message osd_op_reply(392 rbd_data.129d2042eabc234.0605 >> [set-alloc-hint object_size 4194304 write_size 4194304,write 0~126976] >> v26497'18879046 uv18879046 ondisk = 0) v6 remote, 172.17.1.3:0/5911141, >> failed lossy con, dropping message 0x7fa8830edb00 >> 2017-12-19 16:59:25.718694 7fa9849ed700 0 -- 172.17.3.2:6802/3775054 >> submit_message osd_op_reply(10610054 >> rbd_data.6ccd3348ab9aac.011d [set-alloc-hint object_size >> 8388608 write_size 8388608,write 876544~4096] v26497'15075797 uv15075797 >> ondisk = 0) v6 remote, 172.17.1.4:0/1028032, failed lossy con, dropping >> message 0x7fa87a911700 >> >> >> -- >> Sergio A. Morales >> Ingeniero de Sistemas >> LINETS CHILE - 56 2 2412 5858 >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > > -- > Lorenzo Garuti > CED MaxMara > email: garut...@maxmara.it > tel: 0522 3993772 - 335 8416054 > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSDs wrongly marked down
Hi Sergio, in my case it was a network problem, occasionally (due to network problems) mon.{id} can't reach osd.{id}. The massage fault, initiating reconnect and failed lossy con in your logs suggest a network problem. See also: http://docs.ceph.com/docs/giant/rados/troubleshooting/troubleshooting-osd/#flapping-osds https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/troubleshooting_guide/troubleshooting-osds#flapping-osds Lorenzo 2017-12-20 15:13 GMT+01:00 Sergio Morales: > Hi. > > I'm having problem with the OSD en my cluster. > > > Randomly some OSD get wrongly marked down. I set my "mon osd min down > reporters " to OSD +1, but i still get this problem. > > Any tips or ideas to do the troubleshooting? I'm using Ceph 0.94.5 on > Centos 7. > > The logs shows this: > > 2017-12-19 16:59:26.357707 7fa9177d3700 0 -- 172.17.4.2:6830/4775054 >> > 172.17.4.3:6800/2009784 pipe(0x7fa8a0907000 sd=43 :45955 s=1 pgs=1089 > cs=1 l=0 c=0x7fa8a0965f00).connect got RESETSESSION > 2017-12-19 16:59:26.360240 7fa8e5652700 0 -- 172.17.4.2:6830/4775054 >> > 172.17.4.1:6808/6007742 pipe(0x7fa9310e3000 sd=26 :53375 s=2 pgs=5272 > cs=1 l=0 c=0x7fa931045680).fault, initiating reconnect > > 2017-12-19 16:59:25.716758 7fa8e74c1700 0 -- 172.17.4.2:6830/4775054 >> > 172.17.4.1:6826/1007559 pipe(0x7fa907052000 sd=17 :45743 s=1 pgs=2105 > cs=1 l=0 c=0x7fa8a051a180).connect got RESETSESSION > 2017-12-19 16:59:25.716308 7fa9849ed700 0 -- 172.17.3.2:6802/3775054 > submit_message osd_op_reply(392 rbd_data.129d2042eabc234.0605 > [set-alloc-hint object_size 4194304 write_size 4194304,write 0~126976] > v26497'18879046 uv18879046 ondisk = 0) v6 remote, 172.17.1.3:0/5911141, > failed lossy con, dropping message 0x7fa8830edb00 > 2017-12-19 16:59:25.718694 7fa9849ed700 0 -- 172.17.3.2:6802/3775054 > submit_message osd_op_reply(10610054 rbd_data.6ccd3348ab9aac.011d > [set-alloc-hint object_size 8388608 write_size 8388608,write 876544~4096] > v26497'15075797 uv15075797 ondisk = 0) v6 remote, 172.17.1.4:0/1028032, > failed lossy con, dropping message 0x7fa87a911700 > > > -- > Sergio A. Morales > Ingeniero de Sistemas > LINETS CHILE - 56 2 2412 5858 > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Lorenzo Garuti CED MaxMara email: garut...@maxmara.it tel: 0522 3993772 - 335 8416054 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSDs wrongly marked down
Hi. I'm having problem with the OSD en my cluster. Randomly some OSD get wrongly marked down. I set my "mon osd min down reporters " to OSD +1, but i still get this problem. Any tips or ideas to do the troubleshooting? I'm using Ceph 0.94.5 on Centos 7. The logs shows this: 2017-12-19 16:59:26.357707 7fa9177d3700 0 -- 172.17.4.2:6830/4775054 >> 172.17.4.3:6800/2009784 pipe(0x7fa8a0907000 sd=43 :45955 s=1 pgs=1089 cs=1 l=0 c=0x7fa8a0965f00).connect got RESETSESSION 2017-12-19 16:59:26.360240 7fa8e5652700 0 -- 172.17.4.2:6830/4775054 >> 172.17.4.1:6808/6007742 pipe(0x7fa9310e3000 sd=26 :53375 s=2 pgs=5272 cs=1 l=0 c=0x7fa931045680).fault, initiating reconnect 2017-12-19 16:59:25.716758 7fa8e74c1700 0 -- 172.17.4.2:6830/4775054 >> 172.17.4.1:6826/1007559 pipe(0x7fa907052000 sd=17 :45743 s=1 pgs=2105 cs=1 l=0 c=0x7fa8a051a180).connect got RESETSESSION 2017-12-19 16:59:25.716308 7fa9849ed700 0 -- 172.17.3.2:6802/3775054 submit_message osd_op_reply(392 rbd_data.129d2042eabc234.0605 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~126976] v26497'18879046 uv18879046 ondisk = 0) v6 remote, 172.17.1.3:0/5911141, failed lossy con, dropping message 0x7fa8830edb00 2017-12-19 16:59:25.718694 7fa9849ed700 0 -- 172.17.3.2:6802/3775054 submit_message osd_op_reply(10610054 rbd_data.6ccd3348ab9aac.011d [set-alloc-hint object_size 8388608 write_size 8388608,write 876544~4096] v26497'15075797 uv15075797 ondisk = 0) v6 remote, 172.17.1.4:0/1028032, failed lossy con, dropping message 0x7fa87a911700 -- Sergio A. Morales Ingeniero de Sistemas LINETS CHILE - 56 2 2412 5858 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com