Re: [ceph-users] Help ! how to recover from total monitor failure in lumnious

Frank Li Fri, 02 Feb 2018 09:55:21 -0800

Yes, I was dealing with an issue where OSD are not peerings, and I was trying 
to see if force-create-pg can help recover the peering.
Data lose is an accepted  possibility.


 I hope this is what you are looking for ?

    -3> 2018-01-31 22:47:22.942394 7fc641d0b700  5 mon.dl1-kaf101@0(electing) 
e6 _ms_dispatch setting monitor caps on this connection
    -2> 2018-01-31 22:47:22.942405 7fc641d0b700  5 
mon.dl1-kaf101@0(electing).paxos(paxos recovering c 28110997..28111530) 
is_readable = 0 - now=2018-01-31 22:47:22.942405 lease_expire=0.000000 has v0 
lc 28111530
    -1> 2018-01-31 22:47:22.942422 7fc641d0b700  5 
mon.dl1-kaf101@0(electing).paxos(paxos recovering c 28110997..28111530) 
is_readable = 0 - now=2018-01-31 22:47:22.942422 lease_expire=0.000000 has v0 
lc 28111530
     0> 2018-01-31 22:47:22.955415 7fc64350e700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/OSDMapMapping.h:
 In function 'void OSDMapMapping::get(pg_t, std::vector<int>*, int*, 
std::vector<int>*, int*) const' thread 7fc64350e700 time 2018-01-31 
22:47:22.952877
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/OSDMapMapping.h:
 288: FAILED assert(pgid.ps() < p->second.pg_num)

 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous 
(stable)

-- 
Efficiency is Intelligent Laziness
On 2/2/18, 9:45 AM, "Sage Weil" <[email protected]> wrote:

    On Fri, 2 Feb 2018, Frank Li wrote:
    > Hi, I ran the ceph osd force-create-pg command in luminious 12.2.2 to 
recover a failed pg, and it
    > Instantly caused all of the monitor to crash, is there anyway to revert 
back to an earlier state of the cluster ?
    > Right now, the monitors refuse to come up, the error message is as 
follows:
    > I’ve filed a ceph ticket for the crash, but just wonder if there is a way 
to get the cluster back up ?
    > 
    > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__tracker.ceph.com_issues_22847&d=DwIDaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=8-PrUTevTN6k7Tl3nH9Gm-Cd_teurkDKr3VHRc5ZqM4&m=nOL-K3EredRTMr3uV0U4iTOCflIKxQgqNo52DGEPY0w&s=2QqLfmo9DbNVtMebeV-jKg5RC4oVx4vcIXSC8vDB88A&e=
    
    Can you includ the bit of the log a few lines up that includes the 
    assertion and file line number that failed?
    
    Also, "during the course of trouble-shooting an osd issue" makes me 
    nervous: force-create-pg creates a new, *empty* PG when all copies of the 
    old one have been lost.  Is that what you meant to do?  It is essentially 
    telling the system to give up and accepting that there is data loss.  Is 
    that what you meant?
    
    Thanks!
    sage
    
    
    > 
    > --- begin dump of recent events ---
    >      0> 2018-01-31 22:47:22.959665 7fc64350e700 -1 *** Caught signal 
(Aborted) **
    > in thread 7fc64350e700 thread_name:cpu_tp
    > 
    > ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous 
(stable)
    > 1: (()+0x8eae11) [0x55f1113fae11]
    > 2: (()+0xf5e0) [0x7fc64aafa5e0]
    > 3: (gsignal()+0x37) [0x7fc647fca1f7]
    > 4: (abort()+0x148) [0x7fc647fcb8e8]
    > 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x284) [0x55f1110fa4a4]
    > 6: (()+0x2ccc4e) [0x55f110ddcc4e]
    > 7: (OSDMonitor::update_creating_pgs()+0x98b) [0x55f11102232b]
    > 8: (C_UpdateCreatingPGs::finish(int)+0x79) [0x55f1110777b9]
    > 9: (Context::complete(int)+0x9) [0x55f110ed30c9]
    > 10: (ParallelPGMapper::WQ::_process(ParallelPGMapper::Item*, 
ThreadPool::TPHandle&)+0x7f) [0x55f111204e1f]
    > 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa8e) [0x55f111100f1e]
    > 12: (ThreadPool::WorkThread::entry()+0x10) [0x55f111101e00]
    > 13: (()+0x7e25) [0x7fc64aaf2e25]
    > 14: (clone()+0x6d) [0x7fc64808d34d]
    > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed 
to interpret this.
    > 
    > --
    > Efficiency is Intelligent Laziness
    > 

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Help ! how to recover from total monitor failure in lumnious

Reply via email to