Re: [ceph-users] OSDs wrongly marked down

2017-12-20 Thread Maged Mokhtar
Could also be your hardware under powered for the io you have. try to
check your resource load during peak workload  together with recovery
and scrubbing going on at same time.  

On 2017-12-20 17:03, David Turner wrote:

> When I have OSDs wrongly marked down it's usually to do with the 
> filestore_split_multiple and filestore_merge_threshold in a thing I call PG 
> subfolder splitting.  This is no longer a factor with bluestore, but as 
> you're running hammer, it's worth a look.  
> http://docs.ceph.com/docs/hammer/rados/configuration/filestore-config-ref/ 
> 
> On Wed, Dec 20, 2017 at 9:31 AM Garuti, Lorenzo  wrote: 
> Hi Sergio,  
> 
> in my case it was a network problem, occasionally  (due to network problems) 
> mon.{id} can't reach osd.{id}. 
> The massage  fault, initiating reconnect and  failed lossy con in your logs 
> suggest a network problem. 
> 
> See also: 
> 
> http://docs.ceph.com/docs/giant/rados/troubleshooting/troubleshooting-osd/#flapping-osds
>  
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/troubleshooting_guide/troubleshooting-osds#flapping-osds
>  
> 
> Lorenzo 
> 
> 2017-12-20 15:13 GMT+01:00 Sergio Morales :
> 
> Hi.
> 
> I'm having problem with the OSD en  my cluster. 
> 
> Randomly some OSD get  wrongly marked down. I set my "mon osd min down 
> reporters " to OSD +1, but i still get this problem.
> 
> Any tips or ideas to do the troubleshooting? I'm using Ceph 0.94.5 on Centos 
> 7.
> 
> The logs shows this:
> 
> 2017-12-19 16:59:26.357707 7fa9177d3700  0 -- 172.17.4.2:6830/4775054 [1] >> 
> 172.17.4.3:6800/2009784 [2] pipe(0x7fa8a0907000 sd=43 :45955 s=1 pgs=1089 
> cs=1 l=0 c=0x7fa8a0965f00).connect got RESETSESSION
> 2017-12-19 16:59:26.360240 7fa8e5652700  0 -- 172.17.4.2:6830/4775054 [1] >> 
> 172.17.4.1:6808/6007742 [3] pipe(0x7fa9310e3000 sd=26 :53375 s=2 pgs=5272 
> cs=1 l=0 c=0x7fa931045680).fault, initiating reconnect
> 
> 2017-12-19 16:59:25.716758 7fa8e74c1700  0 -- 172.17.4.2:6830/4775054 [1] >> 
> 172.17.4.1:6826/1007559 [4] pipe(0x7fa907052000 sd=17 :45743 s=1 pgs=2105 
> cs=1 l=0 c=0x7fa8a051a180).connect got RESETSESSION
> 2017-12-19 16:59:25.716308 7fa9849ed700  0 -- 172.17.3.2:6802/3775054 [5] 
> submit_message osd_op_reply(392 rbd_data.129d2042eabc234.0605 
> [set-alloc-hint object_size 4194304 write_size 4194304,write 0~126976] 
> v26497'18879046 uv18879046 ondisk = 0) v6 remote, 172.17.1.3:0/5911141 [6], 
> failed lossy con, dropping message 0x7fa8830edb00
> 2017-12-19 16:59:25.718694 7fa9849ed700  0 -- 172.17.3.2:6802/3775054 [5] 
> submit_message osd_op_reply(10610054 rbd_data.6ccd3348ab9aac.011d 
> [set-alloc-hint object_size 8388608 write_size 8388608,write 876544~4096] 
> v26497'15075797 uv15075797 ondisk = 0) v6 remote, 172.17.1.4:0/1028032 [7], 
> failed lossy con, dropping message 0x7fa87a911700
> 
> -- 
> 
> Sergio A. Morales 
> Ingeniero de Sistemas 
> LINETS CHILE - 56 2 2412 5858 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> -- 
> 
> Lorenzo Garuti
> CED MaxMara
> email: garut...@maxmara.it 
> tel: 0522 3993772 - 335 8416054 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

 

Links:
--
[1] http://172.17.4.2:6830/4775054
[2] http://172.17.4.3:6800/2009784
[3] http://172.17.4.1:6808/6007742
[4] http://172.17.4.1:6826/1007559
[5] http://172.17.3.2:6802/3775054
[6] http://172.17.1.3:0/5911141
[7] http://172.17.1.4:0/1028032___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs wrongly marked down

2017-12-20 Thread David Turner
When I have OSDs wrongly marked down it's usually to do with the
filestore_split_multiple and filestore_merge_threshold in a thing I call PG
subfolder splitting.  This is no longer a factor with bluestore, but as
you're running hammer, it's worth a look.
http://docs.ceph.com/docs/hammer/rados/configuration/filestore-config-ref/

On Wed, Dec 20, 2017 at 9:31 AM Garuti, Lorenzo  wrote:

> Hi Sergio,
>
> in my case it was a network problem, occasionally  (due to network
> problems) mon.{id} can't reach osd.{id}.
> The massage  fault, initiating reconnect and  failed lossy con in your
> logs suggest a network problem.
>
> See also:
>
>
> http://docs.ceph.com/docs/giant/rados/troubleshooting/troubleshooting-osd/#flapping-osds
>
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/troubleshooting_guide/troubleshooting-osds#flapping-osds
>
> Lorenzo
>
> 2017-12-20 15:13 GMT+01:00 Sergio Morales :
>
>> Hi.
>>
>> I'm having problem with the OSD en  my cluster.
>>
>>
>> Randomly some OSD get  wrongly marked down. I set my "mon osd min down
>> reporters " to OSD +1, but i still get this problem.
>>
>> Any tips or ideas to do the troubleshooting? I'm using Ceph 0.94.5 on
>> Centos 7.
>>
>> The logs shows this:
>>
>> 2017-12-19 16:59:26.357707 7fa9177d3700  0 -- 172.17.4.2:6830/4775054 >>
>> 172.17.4.3:6800/2009784 pipe(0x7fa8a0907000 sd=43 :45955 s=1 pgs=1089
>> cs=1 l=0 c=0x7fa8a0965f00).connect got RESETSESSION
>> 2017-12-19 16:59:26.360240 7fa8e5652700  0 -- 172.17.4.2:6830/4775054 >>
>> 172.17.4.1:6808/6007742 pipe(0x7fa9310e3000 sd=26 :53375 s=2 pgs=5272
>> cs=1 l=0 c=0x7fa931045680).fault, initiating reconnect
>>
>> 2017-12-19 16:59:25.716758 7fa8e74c1700  0 -- 172.17.4.2:6830/4775054 >>
>> 172.17.4.1:6826/1007559 pipe(0x7fa907052000 sd=17 :45743 s=1 pgs=2105
>> cs=1 l=0 c=0x7fa8a051a180).connect got RESETSESSION
>> 2017-12-19 16:59:25.716308 7fa9849ed700  0 -- 172.17.3.2:6802/3775054
>> submit_message osd_op_reply(392 rbd_data.129d2042eabc234.0605
>> [set-alloc-hint object_size 4194304 write_size 4194304,write 0~126976]
>> v26497'18879046 uv18879046 ondisk = 0) v6 remote, 172.17.1.3:0/5911141,
>> failed lossy con, dropping message 0x7fa8830edb00
>> 2017-12-19 16:59:25.718694 7fa9849ed700  0 -- 172.17.3.2:6802/3775054
>> submit_message osd_op_reply(10610054
>> rbd_data.6ccd3348ab9aac.011d [set-alloc-hint object_size
>> 8388608 write_size 8388608,write 876544~4096] v26497'15075797 uv15075797
>> ondisk = 0) v6 remote, 172.17.1.4:0/1028032, failed lossy con, dropping
>> message 0x7fa87a911700
>>
>>
>> --
>> Sergio A. Morales
>> Ingeniero de Sistemas
>> LINETS CHILE - 56 2 2412 5858
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> Lorenzo Garuti
> CED MaxMara
> email: garut...@maxmara.it
> tel: 0522 3993772 - 335 8416054
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs wrongly marked down

2017-12-20 Thread Garuti, Lorenzo
Hi Sergio,

in my case it was a network problem, occasionally  (due to network
problems) mon.{id} can't reach osd.{id}.
The massage  fault, initiating reconnect and  failed lossy con in your logs
suggest a network problem.

See also:

http://docs.ceph.com/docs/giant/rados/troubleshooting/troubleshooting-osd/#flapping-osds
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/troubleshooting_guide/troubleshooting-osds#flapping-osds

Lorenzo

2017-12-20 15:13 GMT+01:00 Sergio Morales :

> Hi.
>
> I'm having problem with the OSD en  my cluster.
>
>
> Randomly some OSD get  wrongly marked down. I set my "mon osd min down
> reporters " to OSD +1, but i still get this problem.
>
> Any tips or ideas to do the troubleshooting? I'm using Ceph 0.94.5 on
> Centos 7.
>
> The logs shows this:
>
> 2017-12-19 16:59:26.357707 7fa9177d3700  0 -- 172.17.4.2:6830/4775054 >>
> 172.17.4.3:6800/2009784 pipe(0x7fa8a0907000 sd=43 :45955 s=1 pgs=1089
> cs=1 l=0 c=0x7fa8a0965f00).connect got RESETSESSION
> 2017-12-19 16:59:26.360240 7fa8e5652700  0 -- 172.17.4.2:6830/4775054 >>
> 172.17.4.1:6808/6007742 pipe(0x7fa9310e3000 sd=26 :53375 s=2 pgs=5272
> cs=1 l=0 c=0x7fa931045680).fault, initiating reconnect
>
> 2017-12-19 16:59:25.716758 7fa8e74c1700  0 -- 172.17.4.2:6830/4775054 >>
> 172.17.4.1:6826/1007559 pipe(0x7fa907052000 sd=17 :45743 s=1 pgs=2105
> cs=1 l=0 c=0x7fa8a051a180).connect got RESETSESSION
> 2017-12-19 16:59:25.716308 7fa9849ed700  0 -- 172.17.3.2:6802/3775054
> submit_message osd_op_reply(392 rbd_data.129d2042eabc234.0605
> [set-alloc-hint object_size 4194304 write_size 4194304,write 0~126976]
> v26497'18879046 uv18879046 ondisk = 0) v6 remote, 172.17.1.3:0/5911141,
> failed lossy con, dropping message 0x7fa8830edb00
> 2017-12-19 16:59:25.718694 7fa9849ed700  0 -- 172.17.3.2:6802/3775054
> submit_message osd_op_reply(10610054 rbd_data.6ccd3348ab9aac.011d
> [set-alloc-hint object_size 8388608 write_size 8388608,write 876544~4096]
> v26497'15075797 uv15075797 ondisk = 0) v6 remote, 172.17.1.4:0/1028032,
> failed lossy con, dropping message 0x7fa87a911700
>
>
> --
> Sergio A. Morales
> Ingeniero de Sistemas
> LINETS CHILE - 56 2 2412 5858
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Lorenzo Garuti
CED MaxMara
email: garut...@maxmara.it
tel: 0522 3993772 - 335 8416054
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSDs wrongly marked down

2017-12-20 Thread Sergio Morales
Hi.

I'm having problem with the OSD en  my cluster.


Randomly some OSD get  wrongly marked down. I set my "mon osd min down
reporters " to OSD +1, but i still get this problem.

Any tips or ideas to do the troubleshooting? I'm using Ceph 0.94.5 on
Centos 7.

The logs shows this:

2017-12-19 16:59:26.357707 7fa9177d3700  0 -- 172.17.4.2:6830/4775054 >>
172.17.4.3:6800/2009784 pipe(0x7fa8a0907000 sd=43 :45955 s=1 pgs=1089 cs=1
l=0 c=0x7fa8a0965f00).connect got RESETSESSION
2017-12-19 16:59:26.360240 7fa8e5652700  0 -- 172.17.4.2:6830/4775054 >>
172.17.4.1:6808/6007742 pipe(0x7fa9310e3000 sd=26 :53375 s=2 pgs=5272 cs=1
l=0 c=0x7fa931045680).fault, initiating reconnect

2017-12-19 16:59:25.716758 7fa8e74c1700  0 -- 172.17.4.2:6830/4775054 >>
172.17.4.1:6826/1007559 pipe(0x7fa907052000 sd=17 :45743 s=1 pgs=2105 cs=1
l=0 c=0x7fa8a051a180).connect got RESETSESSION
2017-12-19 16:59:25.716308 7fa9849ed700  0 -- 172.17.3.2:6802/3775054
submit_message osd_op_reply(392 rbd_data.129d2042eabc234.0605
[set-alloc-hint object_size 4194304 write_size 4194304,write 0~126976]
v26497'18879046 uv18879046 ondisk = 0) v6 remote, 172.17.1.3:0/5911141,
failed lossy con, dropping message 0x7fa8830edb00
2017-12-19 16:59:25.718694 7fa9849ed700  0 -- 172.17.3.2:6802/3775054
submit_message osd_op_reply(10610054
rbd_data.6ccd3348ab9aac.011d [set-alloc-hint object_size
8388608 write_size 8388608,write 876544~4096] v26497'15075797 uv15075797
ondisk = 0) v6 remote, 172.17.1.4:0/1028032, failed lossy con, dropping
message 0x7fa87a911700


-- 
Sergio A. Morales
Ingeniero de Sistemas
LINETS CHILE - 56 2 2412 5858
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com