[ceph-users] Re: [FORGED] Lost all Monitors in Nautilus Upgrade, best way forward?

Wido den Hollander Tue, 18 Feb 2020 22:28:05 -0800


On 2/19/20 5:45 AM, Sean Matheny wrote:
> I wanted to add a specific question to the previous post, in the hopes it’s 
> easier to answer.
> 
> We have a Luminous monitor restored from the OSDs using ceph-object-tool, 
> which seems like the best chance of any success. We followed this rough 
> process:
> 
> https://tracker.ceph.com/issues/24419
> 
> The monitor has come up (as a single monitor cluster), but it’s reporting 
> wildly inaccurate info, such as the number of osds that are down (157 but all 
> 223 are down), and hosts (1, but all 14 are down).
>


Have you verified that the MON's database has the same epoch of the
OSDMap (or newer) as all the other OSDs?

If the OSDs have a newer epoch of the OSDMap than the MON it won't work.

> The OSD Daemons are still off, but I’m not sure if starting them back up with 
> this monitor will make things worse. The fact that this mon daemon can’t even 
> see how many OSDs are correctly down makes me think that nothing good will 
> come from turning the OSDs back on.
> 
> Do I run risk of further corruption (i.e. on the Ceph side, not client data 
> as the cluster is paused) if I proceed and turn on the osd daemons? Or is it 
> worth a shot?
> 
> Are these Ceph health metrics commonly inaccurate until it can talk to the 
> daemons?

The PG stats will be inaccurate indeed and the number of OSDs can vary
as long as they aren't able to peer with each other and the MONs.

> 
> (Also other commands like `ceph osd tree` agree with the below `ceph -s` so 
> far)
> 
> Many thanks for any wisdom… I just don’t want to make things (unnecessarily) 
> much worse.
> 
> Cheers,
> Sean
> 
> 
> root@ntr-mon01:/var/log/ceph# ceph -s
>   cluster:
>     id:     ababdd7f-1040-431b-962c-c45bea5424aa
>     health: HEALTH_WARN
>             pauserd,pausewr,noout,norecover,noscrub,nodeep-scrub flag(s) set
>             157 osds down
>             1 host (15 osds) down
>             Reduced data availability: 12225 pgs inactive, 885 pgs down, 673 
> pgs peering
>             Degraded data redundancy: 14829054/35961087 objects degraded 
> (41.236%), 2869 pgs degraded, 2995 pgs undersized  services:
>     mon: 1 daemons, quorum ntr-mon01
>     mgr: ntr-mon01(active)
>     osd: 223 osds: 66 up, 223 in
>          flags pauserd,pausewr,noout,norecover,noscrub,nodeep-scrub  data:
>     pools:   14 pools, 15220 pgs
>     objects: 10.58M objects, 40.1TiB
>     usage:   43.0TiB used, 121TiB / 164TiB avail
>     pgs:     70.085% pgs unknown
>              10.237% pgs not active
>              14829054/35961087 objects degraded (41.236%)
>              10667 unknown
>              2869  active+undersized+degraded
>              885   down
>              673   peering
>              126   active+undersized
> 
> 
> On 19/02/2020, at 10:18 AM, Sean Matheny 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> Hi folks,
> 
> Our entire cluster is down at the moment.
> 
> We started upgrading from 12.2.13 to 14.2.7 with the monitors. The first 
> monitor we upgraded crashed. We reverted to luminous on this one and tried 
> another, and it was fine. We upgraded the rest, and they all worked.
> 
> Then we upgraded the first one again, and after it became the leader, it 
> died. Then the second one became the leader, and it died. Then the third 
> became the leader, and it died, leaving mon 4 and 5 unable to form a quorum.
> 
> We tried creating a single monitor cluster by editing the monmap of mon05, 
> and it died in the same way, just without the paxos negotiation first.
> 
> We have tried to revert to a luminous (12.2.12) monitor backup taken a few 
> hours before the crash. The mon daemon will start, but is flooded with 
> blocked requests and unknown pgs after a while. For better or worse we 
> removed the “noout” flag and 144 of 232 OSDs are now showing as down:
> 
>  cluster:
>    id:     ababdd7f-1040-431b-962c-c45bea5424aa
>    health: HEALTH_ERR
>            noout,nobackfill,norecover flag(s) set
>            101 osds down
>            9 hosts (143 osds) down
>            1 auth entities have invalid capabilities
>            Long heartbeat ping times on back interface seen, longest is 
> 15424.178 msec
>            Long heartbeat ping times on front interface seen, longest is 
> 14763.145 msec
>            Reduced data availability: 521 pgs inactive, 48 pgs stale
>            274 slow requests are blocked > 32 sec
>            88 stuck requests are blocked > 4096 sec
>            1303 slow ops, oldest one blocked for 174 sec, mon.ntr-mon01 has 
> slow ops
>            too many PGs per OSD (299 > max 250)  services:
>    mon: 1 daemons, quorum ntr-mon01 (age 3m)
>    mgr: ntr-mon01(active, since 30m)
>    mds: cephfs:1 {0=akld2e18u42=up:active(laggy or crashed)}
>    osd: 223 osds: 66 up, 167 in
>         flags noout,nobackfill,norecover
>    rgw: 2 daemons active (ntr-rgw01, ntr-rgw02)  data:
>    pools:   14 pools, 15220 pgs
>    objects: 35.26M objects, 134 TiB
>    usage:   379 TiB used, 1014 TiB / 1.4 PiB avail
>    pgs:     3.423% pgs unknown
>             14651 active+clean
>             521   unknown
>             48    stale+active+clean  io:
>    client:   20 KiB/s rd, 439 KiB/s wr, 7 op/s rd, 54 op/s wr
> 
> These luminous OSD daemons are not down, but are all in fact running. They 
> just have no comms with the monitor:
> 
> 2020-02-19 10:12:33.565680 7ff222e24700  1 osd.0 pg_epoch: 305104 
> pg[100.37as3( v 129516'2 (0'0,129516'2] local-lis/les=297268/297269 n=0 
> ec=129502/129502 lis/c 297268/297268 les/c/f 297269/297358/0 
> 297268/297268/161526) [41,192,216,0,160,117]p41(0) r=3 lpr=305101 
> crt=129516'2 lcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to 
> Stray
> 2020-02-19 10:12:33.565861 7ff222e24700  1 osd.0 pg_epoch: 305104 pg[4.53c( v 
> 305046'1933429 (304777'1931907,305046'1933429] local-lis/les=298009/298010 
> n=7350 ec=768/768 lis/c 298009/298009 les/c/f 298010/298010/0 
> 297268/298009/298009) [0,61,103] r=0 lpr=305101 crt=305046'1933429 lcod 0'0 
> mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
> 2020-02-19 10:12:33.566742 7ff222e24700  1 osd.0 pg_epoch: 305104 
> pg[100.des4( v 129516'1 (0'0,129516'1] local-lis/les=292010/292011 n=1 
> ec=129502/129502 lis/c 292010/292010 les/c/f 292011/292417/0 
> 292010/292010/280955) [149,62,209,187,0,98]p149(0) r=4 lpr=305072 
> crt=129516'1 lcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to 
> Stray
> 2020-02-19 10:12:33.566896 7ff23ccd9e00  0 osd.0 305104 done with init, 
> starting boot process
> 2020-02-19 10:12:33.566956 7ff23ccd9e00  1 osd.0 305104 start_boot
> 
> One oddity in our deployment is that there was a test mds instance, and it is 
> running mimic. I shut it down, as the monitor trace has an MDS call in it, 
> but the nautilus monitors still die the same way.
> 
>   "mds": {
>        "ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic 
> (stable)": 1
>    },
> 
> 
> ...
>   -11> 2020-02-18 09:50:00.800 7fd164a1a700  5 
> mon.ntr-mon02@1(leader).paxos(paxos recovering c 85448935..85449502) 
> is_readable = 0 - now=2020-02-18 09:50:00.804429 lease_expire=0.000000 has v0 
> lc 85449502
>   -10> 2020-02-18 09:50:00.800 7fd164a1a700  5 
> mon.ntr-mon02@1(leader).paxos(paxos recovering c 85448935..85449502) 
> is_readable = 0 - now=2020-02-18 09:50:00.804446 lease_expire=0.000000 has v0 
> lc 85449502
>    -9> 2020-02-18 09:50:00.800 7fd164a1a700  5 
> mon.ntr-mon02@1(leader).paxos(paxos recovering c 85448935..85449502) 
> is_readable = 0 - now=2020-02-18 09:50:00.804460 lease_expire=0.000000 has v0 
> lc 85449502
>    -8> 2020-02-18 09:50:00.800 7fd164a1a700  4 set_mon_vals no callback set
>    -7> 2020-02-18 09:50:00.800 7fd164a1a700  4 mgrc handle_mgr_map Got map 
> version 2301191
>    -6> 2020-02-18 09:50:00.804 7fd164a1a700  4 mgrc handle_mgr_map Active mgr 
> is now v1:10.31.88.17:6801/2924412
>    -5> 2020-02-18 09:50:00.804 7fd164a1a700  0 log_channel(cluster) log [DBG] 
> : monmap e25: 5 mons at 
> {ntr-mon01=v1:10.31.88.14:6789/0,ntr-mon02=v1:10.31.88.15:6789/0,ntr-mon03=v1:10.31.88.16:6789/0,ntr-mon04=v1:10.31.88.17:6789/0,ntr-mon05=v1:10.31.88.18:6789/0}
>    -4> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client _send_to_mon log to 
> self
>    -3> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client  log_queue is 3 
> last_log 3 sent 2 num 3 unsent 1 sending 1
>    -2> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client  will send 
> 2020-02-18 09:50:00.806845 mon.ntr-mon02 (mon.1) 3 : cluster [DBG] monmap 
> e25: 5 mons at 
> {ntr-mon01=v1:10.31.88.14:6789/0,ntr-mon02=v1:10.31.88.15:6789/0,ntr-mon03=v1:10.31.88.16:6789/0,ntr-mon04=v1:10.31.88.17:6789/0,ntr-mon05=v1:10.31.88.18:6789/0}
>    -1> 2020-02-18 09:50:00.804 7fd164a1a700  5 
> mon.ntr-mon02@1(leader).paxos(paxos active c 85448935..85449502) is_readable 
> = 1 - now=2020-02-18 09:50:00.806920 lease_expire=2020-02-18 09:50:05.804479 
> has v0 lc 85449502
>     0> 2020-02-18 09:50:00.812 7fd164a1a700 -1 *** Caught signal (Aborted) **
> in thread 7fd164a1a700 thread_name:ms_dispatch
> 
> ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus 
> (stable)
> 1: (()+0x11390) [0x7fd171e98390]
> 2: (gsignal()+0x38) [0x7fd1715e5428]
> 3: (abort()+0x16a) [0x7fd1715e702a]
> 4: (__gnu_cxx::__verbose_terminate_handler()+0x135) [0x7fd173673bf5]
> 5: (__cxxabiv1::__terminate(void (*)())+0x6) [0x7fd173667bd6]
> 6: (()+0x8b6c21) [0x7fd173667c21]
> 7: (()+0x8c2e34) [0x7fd173673e34]
> 8: (std::__throw_out_of_range(char const*)+0x3f) [0x7fd17367f55f]
> 9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xcf0) [0x79ae00]
> 10: (MDSMonitor::tick()+0xc9) [0x79c669]
> 11: (MDSMonitor::on_active()+0x28) [0x785e88]
> 12: (PaxosService::_active()+0xdd) [0x6d4b2d]
> 13: (Context::complete(int)+0x9) [0x600789]
> 14: (void finish_contexts<std::__cxx11::list<Context*, 
> std::allocator<Context*> > >(CephContext*, std::__cxx11::list<Context*, 
> std::allocator<Context*> >&, int)+0xa8) [0x6299a8]
> 15: (Paxos::finish_round()+0x76) [0x6cb276]
> 16: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xbff) [0x6cc47f]
> 17: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x24b) [0x6ccf2b]
> 18: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x15c5) 
> [0x5fa6f5]
> 19: (Monitor::_ms_dispatch(Message*)+0x4d2) [0x5fad42]
> 20: (Monitor::ms_dispatch(Message*)+0x26) [0x62b046]
> 21: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) 
> [0x6270b6]
> 22: (DispatchQueue::entry()+0x1219) [0x7fd1732b7e59]
> 23: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fd17336836d]
> 24: (()+0x76ba) [0x7fd171e8e6ba]
> 25: (clone()+0x6d) [0x7fd1716b741d]
> ...
> 
> Ceph versions output
> 
> {
>    "mon": {
>        "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) 
> luminous (stable)": 1,
>        "ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) 
> nautilus (stable)": 4
>    },
>    "mgr": {
>        "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) 
> luminous (stable)": 1,
>        "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) 
> luminous (stable)": 1,
>        "ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) 
> nautilus (stable)": 2
>    },
>    "osd": {
>        "ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) 
> luminous (stable)": 175,
>        "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) 
> luminous (stable)": 32,
>        "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) 
> luminous (stable)": 16
>    },
>    "mds": {
>        "ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic 
> (stable)": 1
>    },
>    "rgw": {
>        "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) 
> luminous (stable)": 2
>    },
>    "overall": {
>        "ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) 
> luminous (stable)": 175,
>        "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) 
> luminous (stable)": 35,
>        "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) 
> luminous (stable)": 18,
>        "ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic 
> (stable)": 1,
>        "ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) 
> nautilus (stable)": 6
>    }
> }
> 
> We’ve filed a bug report with the actions of the actual cascading crash 
> described above (when we upgraded mon01 and it became the leader):
> https://tracker.ceph.com/issues/44185 (parts here copied from that report)
> 
> Right now we’re not sure what the best path to some sort of recovery would 
> be. All OSD Daemons are still on Luminous, so AFAICT, we could build the 
> monitor db from the OSDs with 
> https://github.com/ceph/ceph/blob/luminous/doc/rados/troubleshooting/troubleshooting-mon.rst#recovery-using-osds
>  which describes using this script:
> 
> 
> #!/bin/bash
> hosts="ntr-sto01 ntr-sto02"
> ms=/tmp/mon-store/
> mkdir $ms
> # collect the cluster map from OSDs
> for host in $hosts; do
>  echo $host
>  rsync -avz $ms root@$host:$ms
>  rm -rf $ms
>  ssh root@$host <<EOF
>    for osd in /var/lib/ceph/osd/ceph-*; do
>      ceph-objectstore-tool --data-path \$osd --op update-mon-db 
> --mon-store-path $ms
>    done
> EOF
>  rsync -avz root@$host:$ms $ms
> done
> 
> If this is our best idea to try, should we try the mon store from the above 
> script on a luminous or nautilus mon daemon? Any other ideas to try at this 
> dark hour? : \
> 
> Cheers,
> Sean
> _______________________________________________
> ceph-users mailing list -- [email protected]<mailto:[email protected]>
> To unsubscribe send an email to 
> [email protected]<mailto:[email protected]>
> 
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> 
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: [FORGED] Lost all Monitors in Nautilus Upgrade, best way forward?

Reply via email to