Hello Community Need Help with my production Ceph cluster were multiple OSDs are getting crashed after throwing this error
2015-08-11 16:01:19.617860 7f3d95219700 -1 accepter.accepter.bind unable to
bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already
in use
2015-08-11 16:01:19.618929 7f3d95219700 -1 accepter.accepter.bind unable to
bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already
in use
I am seeing this problem second time in last 4 days , earlier i restart OSD
services and they worked initially. But today again OSD’s broke.
Here is the backtrack
-10> 2015-08-10 12:38:02.766359 7faa0abce700 -1 osd.60 39761 heartbeat_check:
no reply from osd.33 ever on either front or back, first ping sent 2015-08-10
12:37:00.655566 (cutoff 2015-08-10 12:37:42.766354)
-9> 2015-08-10 12:38:02.766423 7faa0abce700 -1 osd.60 39761
heartbeat_check: no reply from osd.50 ever on either front or back, first ping
sent 2015-08-10 12:37:00.655566 (cutoff 2015-08-10 12:37:42.766354)
-8> 2015-08-10 12:38:02.766433 7faa0abce700 -1 osd.60 39761
heartbeat_check: no reply from osd.134 ever on either front or back, first ping
sent 2015-08-10 12:37:23.469422 (cutoff 2015-08-10 12:37:42.766354)
-7> 2015-08-10 12:38:02.766446 7faa0abce700 -1 osd.60 39761
heartbeat_check: no reply from osd.200 ever on either front or back, first ping
sent 2015-08-10 12:37:15.361731 (cutoff 2015-08-10 12:37:42.766354)
-6> 2015-08-10 12:38:02.766454 7faa0abce700 -1 osd.60 39761
heartbeat_check: no reply from osd.228 ever on either front or back, first ping
sent 2015-08-10 12:37:00.655566 (cutoff 2015-08-10 12:37:42.766354)
-5> 2015-08-10 12:38:03.259647 7fa9b5b9a700 0 -- 10.100.50.2:0/82807 >>
10.100.50.4:7142/147030592 pipe(0x4ff3200 sd=399 :0 s=1 pgs=0 cs=0 l=1
c=0x44b3de0).fault
-4> 2015-08-10 12:38:03.259682 7fa9b5594700 0 -- 10.100.50.2:0/82807 >>
10.100.50.1:7204/408026440 pipe(0xf278f00 sd=411 :0 s=1 pgs=0 cs=0 l=1
c=0x44b7bc0).fault
-3> 2015-08-10 12:38:03.271675 7fa9ecda2700 0 log [WRN] : map e39763
wrongly marked me down
-2> 2015-08-10 12:38:03.306073 7fa9ecda2700 -1 accepter.accepter.bind
unable to bind to 10.100.50.2:7300 on any port in range 6800-7300: (98) Address
already in use
-1> 2015-08-10 12:38:03.368817 7fa9ecda2700 0 osd.60 39763 prepare_to_stop
starting shutdown
0> 2015-08-10 12:38:03.372071 7fa9ecda2700 -1 common/Mutex.cc: In function
'void Mutex::Lock(bool)' thread 7fa9ecda2700 time 2015-08-10 12:38:03.368886
common/Mutex.cc: 93: FAILED assert(r == 0)
ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
1: (Mutex::Lock(bool)+0x1d3) [0xa83003]
2: (OSD::shutdown()+0x63) [0x63f3f3]
3: (OSD::handle_osd_map(MOSDMap*)+0x1829) [0x64dff9]
4: (OSD::_dispatch(Message*)+0x2fb) [0x6600eb]
5: (OSD::ms_dispatch(Message*)+0x211) [0x6607b1]
6: (DispatchQueue::entry()+0x5a2) [0xb5ac12]
7: (DispatchQueue::DispatchThread::entry()+0xd) [0xaf23ad]
8: /lib64/libpthread.so.0() [0x35952079d1]
9: (clone()+0x6d) [0x3594ee89dd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
My Environment
ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
Kernel : 2.6.32-431.el6.x86_64
CentOS release 6.5 (Final)
I have 4 OSD nodes but just 2 of them has shown this error
I have reported this under http://tracker.ceph.com/issues/12655
<http://tracker.ceph.com/issues/12655>
****************************************************************
Karan Singh
Systems Specialist , Storage Platforms
CSC - IT Center for Science,
Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland
mobile: +358 503 812758
tel. +358 9 4572001
fax +358 9 4572302
http://www.csc.fi/
****************************************************************
>
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
