Hi All,
I have a ceph cluster with 8 nodes, 3 osds in each, 3 monitors.
I ran command :
ceph osd thrash 101
2014-06-09 18:05:58.001622 7f421da58700 0 mon.ip-10-15-16-63@0(leader) e1
handle_command mon_command(
{"prefix": "osd thrash", "num_epochs": 101}
v 0) v1
the osd's are thrashed for 101 times by OSDMonitor.
one osd crashed with assert:
osd.22 was marked down by the monitor on epoch 122.
2014-06-09 18:06:17.817630 7f421da58700 2 mon.ip-10-15-16-63@0(leader).osd
e122 osd.22 DOWN
2014-06-09 18:06:17.817639 7f421da58700 2 mon.ip-10-15-16-63@0(leader).osd
e122 osd.8 IN
2014-06-09 18:06:17.817643 7f421da58700 2 mon.ip-10-15-16-63@0(leader).osd
e122 osd.11 OUT
2014-06-09 18:06:17.817710 7f421da58700 0 log [INF] : osdmap e122: 24 osds:
17 up, 16 in
On epoch 125 of the map sharing osd complained about being wrongly marked
down.
-452> 2014-06-09 18:06:22.880416 7fcfb4b7c700 1 osd.22 124 ms_handle_reset
con 0x6fe8dc0 session 0
-451> 2014-06-09 18:06:22.880433 7fcfb637f700 0 log [WRN] : map e125
wrongly marked me down
-450> 2014-06-09 18:06:22.880440 7fcfb637f700 1 osd.22 125
start_waiting_for_healthy
and osd health was set to not healthy and sync is need for the map after
marking up in the cluster, IO's are in progress on this cluster, so one of
the IO landed up in this OSD which is not active yet.
-1> 2014-06-09 18:06:22.922590 7fcfac36b700 1 osd.22 pg_epoch: 125
pg[4.11d( empty local-les=109 n=0 ec=108 les/c 109/109 123/123/120) [3,14]r=-1
lpr=123 pi=108-122/4 crt=0'0 inactive NOTIFY] state<Start>: transitioning
to Stray
0> 2014-06-09 18:06:22.922629 7fcfab369700 -1 osd/OSD.cc: In function 'void
OSDService::share_map(entity_name_t, Connection*, epoch_t, OSDMapRef&,
epoch_t*)' thread 7fcfab369700 time 2014-06-09 18:06:22.921311
osd/OSD.cc: 4781: FAILED assert(osd->is_active() || osd->is_stopping())
ceph version andisk-sprint-2-drop-3-390-g2dbd85c
(2dbd85c94cf27a1ff0419c5ea9359af7fe30e9b6)
1: (OSDService::share_map(entity_name_t, Connection*, unsigned int,
std::tr1::shared_ptr<OSDMap const>&, unsigned int*)+0x58f) [0x6351df]
2: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x182) [0x635442]
3: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x346) [0x635ce6]
4: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ce)
[0xa4a1ce]
5: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa4c420]
6: (()+0x8182) [0x7fcfc4a7d182]
7: (clone()+0x6d) [0x7fcfc2e1e30d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
Thanks
Sahana
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com