The osd shouldn't be about to peer while it's down. I think this is good information to update your ticket with as it is possible a different code path than anticipated.
Did your cluster see the osd as up? On Sat, Nov 18, 2017, 9:32 AM Ashley Merrick <ash...@amerrick.co.uk> wrote: > Hello, > > > > So seems noup does not help. > > > > Still have the same error : > > > > 2017-11-18 14:26:40.982827 7fb4446cd700 -1 *** Caught signal (Aborted) > **in thread 7fb4446cd700 thread_name:tp_peering > > > > ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous > (stable) > > 1: (()+0xa0c554) [0x56547f500554] > > 2: (()+0x110c0) [0x7fb45cabe0c0] > > 3: (gsignal()+0xcf) [0x7fb45ba85fcf] > > 4: (abort()+0x16a) [0x7fb45ba873fa] > > 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x28e) [0x56547f547f0e] > > 6: (PG::start_peering_interval(std::shared_ptr<OSDMap const>, > std::vector<int, std::allocator<int> > const&, int, std::vector<int, > std::allocator<int> > const&, int, ObjectStore::Transaction*)+0x1569) > [0x56547f029ad9] > > 7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x479) > [0x56547f02a099] > > 8: (boost::statechart::simple_state<PG::RecoveryState::Reset, > PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na>, > (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base > const&, void const*)+0x188) [0x56547f06c6d8] > > 9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, > PG::RecoveryState::Initial, std::allocator<void>, > boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base > const&)+0x69) [0x56547f045549] > > 10: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, > std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, > int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x4a7) > [0x56547f00e837] > > 11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, > PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, > std::less<boost::intrusive_ptr<PG> >, > std::allocator<boost::intrusive_ptr<PG> > >*)+0x2e7) [0x56547ef56e67] > > 12: (OSD::process_peering_events(std::__cxx11::list<PG*, > std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x1e4) [0x56547ef57cb4] > > 13: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, > ThreadPool::TPHandle&)+0x2c) [0x56547efc2a0c] > > 14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x56547f54ef28] > > 15: (ThreadPool::WorkThread::entry()+0x10) [0x56547f5500c0] > > 16: (()+0x7494) [0x7fb45cab4494] > > 17: (clone()+0x3f) [0x7fb45bb3baff] > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > > > I guess even with noup the OSD/PG still has the peer with the other PG’s > which is the stage that causes the failure, most OSD’s seem to stay up for > about 30 seconds, and every time it’s a different PG listed on the failure. > > > > ,Ashley > > > > *From:* David Turner [mailto:drakonst...@gmail.com] > > *Sent:* 18 November 2017 22:19 > *To:* Ashley Merrick <ash...@amerrick.co.uk> > > *Cc:* Eric Nelson <ericnel...@gmail.com>; ceph-us...@ceph.com > > > *Subject:* Re: [ceph-users] OSD Random Failures - Latest Luminous > > > > Does letting the cluster run with noup for a while until all down disks > are idle, and then letting them come in help at all? I don't know your > specific issue and haven't touched bluestore yet, but that is generally > sound advice when is won't start. > > Also is there any pattern to the osds that are down? Common PGs, common > hosts, common ssds, etc? > > > > On Sat, Nov 18, 2017, 7:08 AM Ashley Merrick <ash...@amerrick.co.uk> > wrote: > > Hello, > > > > Any further suggestions or work around’s from anyone? > > > > Cluster is hard down now with around 2% PG’s offline, on the occasion able > to get an OSD to start for a bit but then will seem to do some peering and > again crash with “*** Caught signal (Aborted) **in thread 7f3471c55700 > thread_name:tp_peering” > > > > ,Ashley > > > > *From:* Ashley Merrick > > *Sent:* 16 November 2017 17:27 > *To:* Eric Nelson <ericnel...@gmail.com> > > *Cc:* ceph-us...@ceph.com > *Subject:* Re: [ceph-users] OSD Random Failures - Latest Luminous > > > > Hello, > > > > Good to hear it's not just me, however have a cluster basically offline > due to too many OSD's dropping for this issue. > > > > Anybody have any suggestions? > > > > ,Ashley > ------------------------------ > > *From:* Eric Nelson <ericnel...@gmail.com> > *Sent:* 16 November 2017 00:06:14 > *To:* Ashley Merrick > *Cc:* ceph-us...@ceph.com > *Subject:* Re: [ceph-users] OSD Random Failures - Latest Luminous > > > > I've been seeing these as well on our SSD cachetier that's been ravaged by > disk failures as of late.... Same tp_peering assert as above even running > luminous branch from git. > > > > Let me know if you have a bug filed I can +1 or have found a workaround. > > > > E > > > > On Wed, Nov 15, 2017 at 10:25 AM, Ashley Merrick <ash...@amerrick.co.uk> > wrote: > > Hello, > > > > After replacing a single OSD disk due to a failed disk I am now seeing 2-3 > OSD’s randomly stop and fail to start, do a boot loop get to load_pgs and > then fail with the following (I tried setting OSD log’s to 5/5 but didn’t > get any extra lines around the error just more information pre boot. > > > > Could this be a certain PG causing these OSD’s to crash (6.2f2s10 for > example)? > > > > -9> 2017-11-15 17:37:14.696229 7fa4ec50f700 1 osd.37 pg_epoch: 161571 > pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] > local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 > les/c/f 161521/152523/159786 161517/161519/161519) > [34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647] > r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown > NOTIFY m=21] state<Start>: transitioning to Stray > > -8> 2017-11-15 17:37:14.696239 7fa4ec50f700 5 osd.37 pg_epoch: 161571 > pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] > local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 > les/c/f 161521/152523/159786 161517/161519/161519) > [34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647] > r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown > NOTIFY m=21] exit Start 0.000019 0 0.000000 > > -7> 2017-11-15 17:37:14.696250 7fa4ec50f700 5 osd.37 pg_epoch: 161571 > pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] > local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 > les/c/f 161521/152523/159786 161517/161519/161519) > [34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647] > r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown > NOTIFY m=21] enter Started/Stray > > -6> 2017-11-15 17:37:14.696324 7fa4ec50f700 5 osd.37 pg_epoch: 161571 > pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] > local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 > les/c/f 161519/160963/159786 161517/161517/108939) > [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 > pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] exit > Reset 3.363755 2 0.000076 > > -5> 2017-11-15 17:37:14.696337 7fa4ec50f700 5 osd.37 pg_epoch: 161571 > pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] > local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 > les/c/f 161519/160963/159786 161517/161517/108939) > [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 > pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter > Started > > -4> 2017-11-15 17:37:14.696346 7fa4ec50f700 5 osd.37 pg_epoch: 161571 > pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] > local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 > les/c/f 161519/160963/159786 161517/161517/108939) > [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 > pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter > Start > > -3> 2017-11-15 17:37:14.696353 7fa4ec50f700 1 osd.37 pg_epoch: 161571 > pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] > local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 > les/c/f 161519/160963/159786 161517/161517/108939) > [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 > pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] > state<Start>: transitioning to Stray > > -2> 2017-11-15 17:37:14.696364 7fa4ec50f700 5 osd.37 pg_epoch: 161571 > pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] > local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 > les/c/f 161519/160963/159786 161517/161517/108939) > [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 > pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] exit > Start 0.000018 0 0.000000 > > -1> 2017-11-15 17:37:14.696372 7fa4ec50f700 5 osd.37 pg_epoch: 161571 > pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] > local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 > les/c/f 161519/160963/159786 161517/161517/108939) > [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 > pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter > Started/Stray > > 0> 2017-11-15 17:37:14.697245 7fa4ebd0e700 -1 *** Caught signal > (Aborted) ** > > in thread 7fa4ebd0e700 thread_name:tp_peering > > > > ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous > (stable) > > 1: (()+0xa3acdc) [0x55dfb6ba3cdc] > > 2: (()+0xf890) [0x7fa510e2c890] > > 3: (gsignal()+0x37) [0x7fa50fe66067] > > 4: (abort()+0x148) [0x7fa50fe67448] > > 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x27f) [0x55dfb6be6f5f] > > 6: (PG::start_peering_interval(std::shared_ptr<OSDMap const>, > std::vector<int, std::allocator<int> > const&, int, std::vector<int, > std::allocator<int> > const&, int, ObjectStore::Transaction*)+0x14e3) > [0x55dfb670f8a3] > > 7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x539) > [0x55dfb670ff39] > > 8: (boost::statechart::simple_state<PG::RecoveryState::Reset, > PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na>, > (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base > const&, void const*)+0x244) [0x55dfb67552a4] > > 9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, > PG::RecoveryState::Initial, std::allocator<void>, > boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base > const&)+0x6b) [0x55dfb6732c1b] > > 10: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, > std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, > int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x3e3) > [0x55dfb6702ef3] > > 11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, > PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, > std::less<boost::intrusive_ptr<PG> >, > std::allocator<boost::intrusive_ptr<PG> > >*)+0x20a) [0x55dfb664db2a] > > 12: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > > const&, ThreadPool::TPHandle&)+0x175) [0x55dfb664e6b5] > > 13: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, > ThreadPool::TPHandle&)+0x27) [0x55dfb66ae5a7] > > 14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa8f) [0x55dfb6bedb1f] > > 15: (ThreadPool::WorkThread::entry()+0x10) [0x55dfb6beea50] > > 16: (()+0x8064) [0x7fa510e25064] > > 17: (clone()+0x6d) [0x7fa50ff1962d] > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > > > --- logging levels --- > > 0/ 5 none > > 0/ 1 lockdep > > 0/ 1 context > > 1/ 1 crush > > 1/ 5 mds > > 1/ 5 mds_balancer > > 1/ 5 mds_locker > > 1/ 5 mds_log > > 1/ 5 mds_log_expire > > 1/ 5 mds_migrator > > 0/ 1 buffer > > 0/ 1 timer > > 0/ 1 filer > > 0/ 1 striper > > 0/ 1 objecter > > 0/ 5 rados > > 0/ 5 rbd > > 0/ 5 rbd_mirror > > 0/ 5 rbd_replay > > 0/ 5 journaler > > 0/ 5 objectcacher > > 0/ 5 client > > 1/ 5 osd > > 0/ 5 optracker > > 0/ 5 objclass > > 1/ 3 filestore > > 1/ 3 journal > > 0/ 5 ms > > 1/ 5 mon > > 0/10 monc > > 1/ 5 paxos > > 0/ 5 tp > > 1/ 5 auth > > 1/ 5 crypto > > 1/ 1 finisher > > 1/ 5 heartbeatmap > > 1/ 5 perfcounter > > 1/ 5 rgw > > 1/10 civetweb > > 1/ 5 javaclient > > 1/ 5 asok > > 1/ 1 throttle > > 0/ 0 refs > > 1/ 5 xio > > 1/ 5 compressor > > 1/ 5 bluestore > > 1/ 5 bluefs > > 1/ 3 bdev > > 1/ 5 kstore > > 4/ 5 rocksdb > > 4/ 5 leveldb > > 4/ 5 memdb > > 1/ 5 kinetic > > 1/ 5 fuse > > 1/ 5 mgr > > 1/ 5 mgrc > > 1/ 5 dpdk > > 1/ 5 eventtrace > > -2/-2 (syslog threshold) > > -1/-1 (stderr threshold) > > max_recent 10000 > > max_new 1000 > > log_file /var/log/ceph/ceph-osd.37.log > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com