Could also be your hardware under powered for the io you have. try to
check your resource load during peak workload together with recovery
and scrubbing going on at same time.
On 2017-12-20 17:03, David Turner wrote:
> When I have OSDs wrongly marked down it's usually to do with the
> filestore_split_multiple and filestore_merge_threshold in a thing I call PG
> subfolder splitting. This is no longer a factor with bluestore, but as
> you're running hammer, it's worth a look.
> http://docs.ceph.com/docs/hammer/rados/configuration/filestore-config-ref/
>
> On Wed, Dec 20, 2017 at 9:31 AM Garuti, Lorenzo <[email protected]> wrote:
> Hi Sergio,
>
> in my case it was a network problem, occasionally (due to network problems)
> mon.{id} can't reach osd.{id}.
> The massage fault, initiating reconnect and failed lossy con in your logs
> suggest a network problem.
>
> See also:
>
> http://docs.ceph.com/docs/giant/rados/troubleshooting/troubleshooting-osd/#flapping-osds
>
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/troubleshooting_guide/troubleshooting-osds#flapping-osds
>
>
> Lorenzo
>
> 2017-12-20 15:13 GMT+01:00 Sergio Morales <[email protected]>:
>
> Hi.
>
> I'm having problem with the OSD en my cluster.
>
> Randomly some OSD get wrongly marked down. I set my "mon osd min down
> reporters " to OSD +1, but i still get this problem.
>
> Any tips or ideas to do the troubleshooting? I'm using Ceph 0.94.5 on Centos
> 7.
>
> The logs shows this:
>
> 2017-12-19 16:59:26.357707 7fa9177d3700 0 -- 172.17.4.2:6830/4775054 [1] >>
> 172.17.4.3:6800/2009784 [2] pipe(0x7fa8a0907000 sd=43 :45955 s=1 pgs=1089
> cs=1 l=0 c=0x7fa8a0965f00).connect got RESETSESSION
> 2017-12-19 16:59:26.360240 7fa8e5652700 0 -- 172.17.4.2:6830/4775054 [1] >>
> 172.17.4.1:6808/6007742 [3] pipe(0x7fa9310e3000 sd=26 :53375 s=2 pgs=5272
> cs=1 l=0 c=0x7fa931045680).fault, initiating reconnect
>
> 2017-12-19 16:59:25.716758 7fa8e74c1700 0 -- 172.17.4.2:6830/4775054 [1] >>
> 172.17.4.1:6826/1007559 [4] pipe(0x7fa907052000 sd=17 :45743 s=1 pgs=2105
> cs=1 l=0 c=0x7fa8a051a180).connect got RESETSESSION
> 2017-12-19 16:59:25.716308 7fa9849ed700 0 -- 172.17.3.2:6802/3775054 [5]
> submit_message osd_op_reply(392 rbd_data.129d2042eabc234.0000000000000605
> [set-alloc-hint object_size 4194304 write_size 4194304,write 0~126976]
> v26497'18879046 uv18879046 ondisk = 0) v6 remote, 172.17.1.3:0/5911141 [6],
> failed lossy con, dropping message 0x7fa8830edb00
> 2017-12-19 16:59:25.718694 7fa9849ed700 0 -- 172.17.3.2:6802/3775054 [5]
> submit_message osd_op_reply(10610054 rbd_data.6ccd3348ab9aac.000000000000011d
> [set-alloc-hint object_size 8388608 write_size 8388608,write 876544~4096]
> v26497'15075797 uv15075797 ondisk = 0) v6 remote, 172.17.1.4:0/1028032 [7],
> failed lossy con, dropping message 0x7fa87a911700
>
> --
>
> Sergio A. Morales
> Ingeniero de Sistemas
> LINETS CHILE - 56 2 2412 5858
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> --
>
> Lorenzo Garuti
> CED MaxMara
> email: [email protected]
> tel: 0522 3993772 - 335 8416054
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Links:
------
[1] http://172.17.4.2:6830/4775054
[2] http://172.17.4.3:6800/2009784
[3] http://172.17.4.1:6808/6007742
[4] http://172.17.4.1:6826/1007559
[5] http://172.17.3.2:6802/3775054
[6] http://172.17.1.3:0/5911141
[7] http://172.17.1.4:0/1028032_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com