Hello Sage
nodown, noout set on cluster
# ceph status
cluster 009d3518-e60d-4f74-a26d-c08c1976263c
health HEALTH_WARN 1133 pgs degraded; 44 pgs incomplete; 42 pgs stale; 45
pgs stuck inactive; 42 pgs stuck stale; 2602 pgs stuck unclean; recovery
206/2199 objects degraded (9.368%); 40/165 in osds are down; nodown,noout
flag(s) set
monmap e4: 4 mons at
{storage0101-ib=192.168.100.101:6789/0,storage0110-ib=192.168.100.110:6789/0,storage0114-ib=192.168.100.114:6789/0,storage0115-ib=192.168.100.115:6789/0},
election epoch 18, quorum 0,1,2,3
storage0101-ib,storage0110-ib,storage0114-ib,storage0115-ib
osdmap e358031: 165 osds: 125 up, 165 in
flags nodown,noout
pgmap v604305: 4544 pgs, 6 pools, 4309 MB data, 733 objects
3582 GB used, 357 TB / 361 TB avail
206/2199 objects degraded (9.368%)
1 inactive
5 stale+active+degraded+remapped
1931 active+clean
2 stale+incomplete
21 stale+active+remapped
380 active+degraded+remapped
38 incomplete
1403 active+remapped
2 stale+active+degraded
1 stale+remapped+incomplete
746 active+degraded
11 stale+active+clean
3 remapped+incomplete
Here is my ceph.conf http://pastebin.com/KZdgPJm7 (debus osd , ms set )
I tried restarting all OSD services of node-13 , services came up after
several attempts of “service ceph restart” http://pastebin.com/yMk86YHh
For Node : 14
All services are up
[root@storage0114-ib ~]# service ceph status
=== osd.142 ===
osd.142: running {"version":"0.80-475-g9e80c29"}
=== osd.36 ===
osd.36: running {"version":"0.80-475-g9e80c29"}
=== osd.83 ===
osd.83: running {"version":"0.80-475-g9e80c29"}
=== osd.107 ===
osd.107: running {"version":"0.80-475-g9e80c29"}
=== osd.47 ===
osd.47: running {"version":"0.80-475-g9e80c29"}
=== osd.130 ===
osd.130: running {"version":"0.80-475-g9e80c29"}
=== osd.155 ===
osd.155: running {"version":"0.80-475-g9e80c29"}
=== osd.60 ===
osd.60: running {"version":"0.80-475-g9e80c29"}
=== osd.118 ===
osd.118: running {"version":"0.80-475-g9e80c29"}
=== osd.98 ===
osd.98: running {"version":"0.80-475-g9e80c29"}
=== osd.70 ===
osd.70: running {"version":"0.80-475-g9e80c29"}
=== mon.storage0114-ib ===
mon.storage0114-ib: running {"version":"0.80-475-g9e80c29"}
[root@storage0114-ib ~]#
— But ceph osd tree says , osd.118 is down
-10 29.93 host storage0114-ib
36 2.63 osd.36 up 1
47 2.73 osd.47 up 1
60 2.73 osd.60 up 1
70 2.73 osd.70 up 1
83 2.73 osd.83 up 1
98 2.73 osd.98 up 1
107 2.73 osd.107 up 1
118 2.73 osd.118 down 1
130 2.73 osd.130 up 1
142 2.73 osd.142 up 1
155 2.73 osd.155 up 1
— I restarted osd.118 service and it was successful , But still its showing as
down in ceph osd tree . I waited for 30 minutes to get it stable but still not
showing UP in ceph osd tree.
Moreover its generating HUGE logs http://pastebin.com/mDYnjAni
The problem now is if i manually visit every host and check for “service ceph
status “ all services are running on all 15 hosts. But this is not getting
reflected to ceph osd tree and ceph -s and they continue to show as OSD DOWN.
My irc id is ksingh , let me know by email once you are available on IRC (my
time zone is Finland +2)
- Karan Singh -
On 20 May 2014, at 18:18, Sage Weil <[email protected]> wrote:
> On Tue, 20 May 2014, Karan Singh wrote:
>> Hello Cephers , need your suggestion for troubleshooting.
>>
>> My cluster is terribly struggling , 70+ osd are down out of 165
>>
>> Problem ?>OSD are getting marked out of cluster and are down. The cluster is
>> degraded. On checking logs of failed OSD we are getting wired entries that
>> are continuously getting generated.
>
> Tracking this at http://tracker.ceph.com/issues/8387
>
> The most recent bits you posted in the ticket don't quite make sense: the
> OSD is trying to connect to an address for an OSD that is currently marked
> down. I suspect this is just timing between when the logs were captured
> and when teh ceph osd dump was captured. To get a complete pictures,
> please:
>
> 1) add
>
> debug osd = 20
> debug ms = 1
>
> in [osd] and restart all osds
>
> 2) ceph osd set nodown
>
> (to prevent flapping)
>
> 3) find some OSD that is showing these messages
>
> 4) capture a 'ceph osd dump' output.
>
> Also happy to debug this interactively over IRC; that will likely be
> faster!
>
> Thanks-
> sage
>
>
>
>>
>> Osd Debug logs :: http://pastebin.com/agTKh6zB
>>
>>
>> 1. 2014-05-20 10:19:03.699886 7f2328e237a0 0 osd.158 357532 done with
>> init, starting boot process
>> 2. 2014-05-20 10:19:03.700093 7f22ff621700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.109:6802/910005982 pipe(0x8698500 sd=35 :33500 s=1 pgs=0 cs=0
>> l=0 c=0x83018c0).connect claims to be 192.168.1.109:6802/63896 not
>> 192.168.1.109:6802/910005982 - wrong node!
>> 3. 2014-05-20 10:19:03.700152 7f22ff621700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.109:6802/910005982 pipe(0x8698500 sd=35 :33500 s=1 pgs=0 cs=0
>> l=0 c=0x83018c0).fault with nothing to send, going to standby
>> 4. 2014-05-20 10:19:09.551269 7f22fdd12700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.109:6803/1176009454 pipe(0x56aee00 sd=53 :40060 s=1 pgs=0 cs=0
>> l=0 c=0x533fd20).connect claims to be 192.168.1.109:6803/63896 not
>> 192.168.1.109:6803/1176009454 - wrong node!
>> 5. 2014-05-20 10:19:09.551347 7f22fdd12700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.109:6803/1176009454 pipe(0x56aee00 sd=53 :40060 s=1 pgs=0 cs=0
>> l=0 c=0x533fd20).fault with nothing to send, going to standby
>> 6. 2014-05-20 10:19:09.703901 7f22fd80d700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.113:6802/13870 pipe(0x56adf00 sd=137 :42889 s=1 pgs=0 cs=0 l=0
>> c=0x8302aa0).connect claims to be 192.168.1.113:6802/24612 not
>> 192.168.1.113:6802/13870 - wrong node!
>> 7. 2014-05-20 10:19:09.704039 7f22fd80d700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.113:6802/13870 pipe(0x56adf00 sd=137 :42889 s=1 pgs=0 cs=0 l=0
>> c=0x8302aa0).fault with nothing to send, going to standby
>> 8. 2014-05-20 10:19:10.243139 7f22fd005700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0
>> c=0x8304780).connect claims to be 192.168.1.112:6800/2852 not
>> 192.168.1.112:6800/14114 - wrong node!
>> 9. 2014-05-20 10:19:10.243190 7f22fd005700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0
>> c=0x8304780).fault with nothing to send, going to standby
>> 10. 2014-05-20 10:19:10.349693 7f22fc7fd700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.109:6800/13492 pipe(0x8698c80 sd=156 :0 s=1 pgs=0 cs=0 l=0
>> c=0x83070c0).fault with nothing to send, going to standby
>>
>>
>> 1. ceph -v
>> ceph version 0.80-469-g991f7f1
>> (991f7f15a6e107b33a24bbef1169f21eb7fcce2c) #
>> 1. ceph osd stat
>> osdmap e357073: 165 osds: 91 up, 165 in
>> flags noout #
>>
>> I have tried doing :
>>
>> 1. Restarting the problematic OSDs , but no luck
>> 2. i restarted entire host but no luck, still osds are down and getting the
>> same mesage
>>
>> 1. 2014-05-20 10:19:10.243139 7f22fd005700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0
>> c=0x8304780).connect claims to be 192.168.1.112:6800/2852 not
>> 192.168.1.112:6800/14114 - wrong node!
>> 2. 2014-05-20 10:19:10.243190 7f22fd005700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0
>> c=0x8304780).fault with nothing to send, going to standby
>> 3. 2014-05-20 10:19:10.349693 7f22fc7fd700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.109:6800/13492 pipe(0x8698c80 sd=156 :0 s=1 pgs=0 cs=0 l=0
>> c=0x83070c0).fault with nothing to send, going to standby
>> 4. 2014-05-20 10:22:23.312473 7f2307e61700 0 osd.158 357781 do_command r=0
>> 5. 2014-05-20 10:22:23.326110 7f2307e61700 0 osd.158 357781 do_command r=0
>> debug_osd=0/5
>> 6. 2014-05-20 10:22:23.326123 7f2307e61700 0 log [INF] : debug_osd=0/5
>> 7. 2014-05-20 10:34:08.161864 7f230224d700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.102:6808/13276 pipe(0x8698280 sd=22 :41078 s=2 pgs=603 cs=1
>> l=0 c=0x8301600).fault with nothing to send, going to standby
>>
>> 3. Disks do not have errors , no message in dmesg and /var/log/messages
>>
>> 4. there was a bug in the past http://tracker.ceph.com/issues/4006 , dont
>> know it again came bacin in Firefly
>>
>> 5. Recently no activity performed on cluster , except some pool and keys
>> creation for cinder /glance integration
>>
>> 6. Nodes have enough free resources for osds.
>>
>> 7. No issues with network , osds are down on all cluster nodes. not from a
>> single node.
>>
>>
>> ****************************************************************
>> Karan Singh
>> Systems Specialist , Storage Platforms
>> CSC - IT Center for Science,
>> Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland
>> mobile: +358 503 812758
>> tel. +358 9 4572001
>> fax +358 9 4572302
>> http://www.csc.fi/
>> ****************************************************************
>>
>>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com