When I have OSDs wrongly marked down it's usually to do with the
filestore_split_multiple and filestore_merge_threshold in a thing I call PG
subfolder splitting.  This is no longer a factor with bluestore, but as
you're running hammer, it's worth a look.
http://docs.ceph.com/docs/hammer/rados/configuration/filestore-config-ref/

On Wed, Dec 20, 2017 at 9:31 AM Garuti, Lorenzo <[email protected]> wrote:

> Hi Sergio,
>
> in my case it was a network problem, occasionally  (due to network
> problems) mon.{id} can't reach osd.{id}.
> The massage  fault, initiating reconnect and  failed lossy con in your
> logs suggest a network problem.
>
> See also:
>
>
> http://docs.ceph.com/docs/giant/rados/troubleshooting/troubleshooting-osd/#flapping-osds
>
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/troubleshooting_guide/troubleshooting-osds#flapping-osds
>
> Lorenzo
>
> 2017-12-20 15:13 GMT+01:00 Sergio Morales <[email protected]>:
>
>> Hi.
>>
>> I'm having problem with the OSD en  my cluster.
>>
>>
>> Randomly some OSD get  wrongly marked down. I set my "mon osd min down
>> reporters " to OSD +1, but i still get this problem.
>>
>> Any tips or ideas to do the troubleshooting? I'm using Ceph 0.94.5 on
>> Centos 7.
>>
>> The logs shows this:
>>
>> 2017-12-19 16:59:26.357707 7fa9177d3700  0 -- 172.17.4.2:6830/4775054 >>
>> 172.17.4.3:6800/2009784 pipe(0x7fa8a0907000 sd=43 :45955 s=1 pgs=1089
>> cs=1 l=0 c=0x7fa8a0965f00).connect got RESETSESSION
>> 2017-12-19 16:59:26.360240 7fa8e5652700  0 -- 172.17.4.2:6830/4775054 >>
>> 172.17.4.1:6808/6007742 pipe(0x7fa9310e3000 sd=26 :53375 s=2 pgs=5272
>> cs=1 l=0 c=0x7fa931045680).fault, initiating reconnect
>>
>> 2017-12-19 16:59:25.716758 7fa8e74c1700  0 -- 172.17.4.2:6830/4775054 >>
>> 172.17.4.1:6826/1007559 pipe(0x7fa907052000 sd=17 :45743 s=1 pgs=2105
>> cs=1 l=0 c=0x7fa8a051a180).connect got RESETSESSION
>> 2017-12-19 16:59:25.716308 7fa9849ed700  0 -- 172.17.3.2:6802/3775054
>> submit_message osd_op_reply(392 rbd_data.129d2042eabc234.0000000000000605
>> [set-alloc-hint object_size 4194304 write_size 4194304,write 0~126976]
>> v26497'18879046 uv18879046 ondisk = 0) v6 remote, 172.17.1.3:0/5911141,
>> failed lossy con, dropping message 0x7fa8830edb00
>> 2017-12-19 16:59:25.718694 7fa9849ed700  0 -- 172.17.3.2:6802/3775054
>> submit_message osd_op_reply(10610054
>> rbd_data.6ccd3348ab9aac.000000000000011d [set-alloc-hint object_size
>> 8388608 write_size 8388608,write 876544~4096] v26497'15075797 uv15075797
>> ondisk = 0) v6 remote, 172.17.1.4:0/1028032, failed lossy con, dropping
>> message 0x7fa87a911700
>>
>>
>> --
>> Sergio A. Morales
>> Ingeniero de Sistemas
>> LINETS CHILE - 56 2 2412 5858
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> Lorenzo Garuti
> CED MaxMara
> email: [email protected]
> tel: 0522 3993772 - 335 8416054
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to