Hi, Yes. Nice. Until all your OSD fails and you don't know what else to try. Looking at the faillure rates it will happen very soon.
I want to recover them. I'm writing in another mail what I tried. Let see if someone can help me. I'm not doing anything. Just looking at my cluster from time to time to find that something else failed. I will do hard to recover this situation. Thank you. On 26/11/17 16:13, Marc Roos wrote: > > If I am not mistaken, the whole idea with the 3 replica's is dat you > have enough copies to recover from a failed osd. In my tests this seems > to go fine automatically. Are you doing something that is not adviced? > > > > > -----Original Message----- > From: Gonzalo Aguilar Delgado [mailto:[email protected]] > Sent: zaterdag 25 november 2017 20:44 > To: 'ceph-users' > Subject: [ceph-users] Another OSD broken today. How can I recover it? > > Hello, > > > I had another blackout with ceph today. It seems that ceph osd's fall > from time to time and they are unable to recover. I have 3 OSD's down > now. 1 removed from the cluster and 2 down because I'm unable to recover > them. > > > We really need a recovery tool. It's not normal that an OSD breaks and > there's no way to recover. Is there any way to do it? > > > Last one shows this: > > > > > ] enter Reset > -12> 2017-11-25 20:34:19.548891 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 > pg[0.34(unlocked)] enter Initial > -11> 2017-11-25 20:34:19.548983 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 > pg[0.34( empty local-les=9685 n=0 ec=404 les/c/f 9685/9685/0 > 9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive NIBBLEWISE] > exit Initial 0.000091 0 0.000000 > -10> 2017-11-25 20:34:19.548994 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 > pg[0.34( empty local-les=9685 n=0 ec=404 les/c/f 9685/9685/0 > 9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive NIBBLEWISE] > enter Reset > -9> 2017-11-25 20:34:19.549166 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 > pg[10.36(unlocked)] enter Initial > -8> 2017-11-25 20:34:19.566781 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 > pg[10.36( v 9686'7301894 (9686'7298879,9686'7301894] local-les=9685 > n=534 ec=419 les/c/f 9685/9686/0 9684/9684/9684) [4,0] r=0 lpr=0 > crt=9686'7301894 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial > 0.017614 0 0.000000 > -7> 2017-11-25 20:34:19.566811 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 > pg[10.36( v 9686'7301894 (9686'7298879,9686'7301894] local-les=9685 > n=534 ec=419 les/c/f 9685/9686/0 9684/9684/9684) [4,0] r=0 lpr=0 > crt=9686'7301894 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset > -6> 2017-11-25 20:34:19.585411 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 > pg[8.5c(unlocked)] enter Initial > -5> 2017-11-25 20:34:19.602888 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 > pg[8.5c( empty local-les=9685 n=0 ec=348 les/c/f 9685/9685/0 > 9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive NIBBLEWISE] > exit Initial 0.017478 0 0.000000 > -4> 2017-11-25 20:34:19.602912 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 > pg[8.5c( empty local-les=9685 n=0 ec=348 les/c/f 9685/9685/0 > 9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive NIBBLEWISE] > enter Reset > -3> 2017-11-25 20:34:19.603082 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 > pg[9.10(unlocked)] enter Initial > -2> 2017-11-25 20:34:19.615456 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 > pg[9.10( v 9686'2322547 (9031'2319518,9686'2322547] local-les=9685 n=261 > ec=417 les/c/f 9685/9685/0 9684/9684/9684) [4,0] r=0 lpr=0 > crt=9686'2322547 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial > 0.012373 0 0.000000 > -1> 2017-11-25 20:34:19.615481 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 > pg[9.10( v 9686'2322547 (9031'2319518,9686'2322547] local-les=9685 n=261 > ec=417 les/c/f 9685/9685/0 9684/9684/9684) [4,0] r=0 lpr=0 > crt=9686'2322547 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset > 0> 2017-11-25 20:34:19.617400 7f6e5dc158c0 -1 osd/PG.cc: In > function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, > ceph::bufferlist*)' thread 7f6e5dc158c0 time 2017-11-25 20:34:19.615633 > osd/PG.cc: 3025: FAILED assert(values.size() == 2) > > ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x80) [0x5562d318d790] > 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, > ceph::buffer::list*)+0x661) [0x5562d2b4b601] > 3: (OSD::load_pgs()+0x75a) [0x5562d2a9f8aa] > 4: (OSD::init()+0x2026) [0x5562d2aaaca6] > 5: (main()+0x2ef1) [0x5562d2a1c301] > 6: (__libc_start_main()+0xf0) [0x7f6e5aa75830] > 7: (_start()+0x29) [0x5562d2a5db09] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 rbd_mirror > 0/ 5 rbd_replay > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 0/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/10 civetweb > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > 0/ 0 refs > 1/ 5 xio > 1/ 5 compressor > 1/ 5 newstore > 1/ 5 bluestore > 1/ 5 bluefs > 1/ 3 bdev > 1/ 5 kstore > 4/ 5 rocksdb > 4/ 5 leveldb > 1/ 5 kinetic > 1/ 5 fuse > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-osd.4.log > --- end dump of recent events --- > 2017-11-25 20:34:19.622559 7f6e5dc158c0 -1 *** Caught signal (Aborted) > ** in thread 7f6e5dc158c0 thread_name:ceph-osd > > ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) > 1: (()+0x98653e) [0x5562d308d53e] > 2: (()+0x11390) [0x7f6e5caee390] > 3: (gsignal()+0x38) [0x7f6e5aa8a428] > 4: (abort()+0x16a) [0x7f6e5aa8c02a] > 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x26b) [0x5562d318d97b] > 6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, > ceph::buffer::list*)+0x661) [0x5562d2b4b601] > 7: (OSD::load_pgs()+0x75a) [0x5562d2a9f8aa] > 8: (OSD::init()+0x2026) [0x5562d2aaaca6] > 9: (main()+0x2ef1) [0x5562d2a1c301] > 10: (__libc_start_main()+0xf0) [0x7f6e5aa75830] > 11: (_start()+0x29) [0x5562d2a5db09] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > --- begin dump of recent events --- > 0> 2017-11-25 20:34:19.622559 7f6e5dc158c0 -1 *** Caught signal > (Aborted) ** in thread 7f6e5dc158c0 thread_name:ceph-osd > > ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) > 1: (()+0x98653e) [0x5562d308d53e] > 2: (()+0x11390) [0x7f6e5caee390] > 3: (gsignal()+0x38) [0x7f6e5aa8a428] > 4: (abort()+0x16a) [0x7f6e5aa8c02a] > 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x26b) [0x5562d318d97b] > 6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, > ceph::buffer::list*)+0x661) [0x5562d2b4b601] > 7: (OSD::load_pgs()+0x75a) [0x5562d2a9f8aa] > 8: (OSD::init()+0x2026) [0x5562d2aaaca6] > 9: (main()+0x2ef1) [0x5562d2a1c301] > 10: (__libc_start_main()+0xf0) [0x7f6e5aa75830] > 11: (_start()+0x29) [0x5562d2a5db09] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 rbd_mirror > 0/ 5 rbd_replay > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 0/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/10 civetweb > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > 0/ 0 refs > 1/ 5 xio > 1/ 5 compressor > 1/ 5 newstore > 1/ 5 bluestore > 1/ 5 bluefs > 1/ 3 bdev > 1/ 5 kstore > 4/ 5 rocksdb > 4/ 5 leveldb > 1/ 5 kinetic > 1/ 5 fuse > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-osd.4.log > > > > > > > > >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
