Re: [ceph-users] Active MON aborts on Jewel 10.2.2 with FAILED assert(info.state == MDSMap::STATE_STANDBY

Bill Sharer Fri, 08 Jul 2016 17:47:03 -0700

Just for giggles I tried the rolling upgrade to 10.2.2 again today.This time I rolled mon.0 and osd.0 first while keeping the mds serversup and then rolled them before moving on to the other three. Noassertion failure this time since I guess I always had an mds active. Iwonder if I will have a problem though if I do a complete cold start ofthe cluster.


Bill Sharer



On 07/06/2016 04:19 PM, Bill Sharer wrote:

Manual downgrade to 10.2.0 put me back in business. I'm going to mask10.2.2 and then try to let 10.2.1 emerge.
Bill Sharer

On 07/06/2016 02:16 PM, Bill Sharer wrote:
I noticed on that USE list that the 10.2.2 ebuild introduced a newcephfs emerge flag, so I enabled that and emerged everywhere again.The active mon is still crashing on the assertion though.
Bill Sharer

On 07/05/2016 08:14 PM, Bill Sharer wrote:
Relevant USE flags FWIW....

# emerge -pv ceph

These are the packages that would be merged, in order:

Calculating dependencies... done!
[ebuild R ~] sys-cluster/ceph-10.2.2::gentoo USE="fuse gtkjemalloc ldap libaio libatomic nss radosgw static-libs xfs-babeltrace -cephfs -cryptopp -debug -lttng -tcmalloc {-test} -zfs"PYTHON_TARGETS="python2_7 python3_4 -python3_5" 11,271 KiB
Bill Sharer


On 07/05/2016 01:45 PM, Gregory Farnum wrote:
Thanks for the report; created a ticket and somebody will get on it
shortly. http://tracker.ceph.com/issues/16592
-Greg
On Sun, Jul 3, 2016 at 5:55 PM, Bill Sharer<bsha...@sharerland.com> wrote:
I was working on a rolling upgrade on Gentoo to Jewel 10.2.2 from10.2.0.However now I can't get a monitor quorum going again because assoon as I
get one, the mon which wins the election blows out with an assertion
failure.  Here's my status at the moment

kroll1    10.2.2    ceph mon.0 and ceph osd.0 normally my lead mon
kroll2    10.2.2    ceph mon 1 and ceph osd 2
kroll3    10.2.2    ceph osd 1
kroll4    10.2.2    ceph mon 3 and ceph osd 3
kroll5    10.2.2    ceph mon 4 and ceph mds 2 normally my active mds
kroll6    10.2.0    ceph mon 5 and ceph mds B normally standby mds
I had done rolling upgrade of everything but kroll6 and hadrebooted thefirst three osd and mon servers. mds 2 went down during gentooupdate ofkroll5 because of memory scarcity so mds B was the active mdsserver. After
rebooting kroll4 I found that mon 0 had gone done with the assertion
failure. I ended up stopping all ceph processes but desktops withclientmounts were all still up for the moment and basically would bestuck on
locks if I tried to access cephfs.

After trying to restart mons only beginning with mon 0 initially, the
following happened to mon.0 after enough mons were up for a quorum:
2016-07-03 16:34:26.555728 7fbff22f8480 1 leveldb: Recovering log#25923902016-07-03 16:34:26.555762 7fbff22f8480 1 leveldb: Level-0 table#2592397:
started
2016-07-03 16:34:26.558788 7fbff22f8480 1 leveldb: Level-0 table#2592397:
192 bytes OK
2016-07-03 16:34:26.562263 7fbff22f8480 1 leveldb: Delete type=3#2592388
2016-07-03 16:34:26.562364 7fbff22f8480 1 leveldb: Delete type=0#2592390
2016-07-03 16:34:26.563126 7fbff22f8480 -1 wrote monmap to
/etc/ceph/tmpmonmap
2016-07-03 17:09:25.753729 7f8291dff480  0 ceph version 10.2.2
(45107e21c568dd033c2f0a3107dec8f0b0e58374), pro
cess ceph-mon, pid 20842
2016-07-03 17:09:25.762588 7f8291dff480 1 leveldb: Recovering log#25923982016-07-03 17:09:25.767722 7f8291dff480 1 leveldb: Delete type=0#2592398
2016-07-03 17:09:25.767803 7f8291dff480 1 leveldb: Delete type=3#2592396
2016-07-03 17:09:25.768600 7f8291dff480  0 starting mon.0 rank 0 at
192.168.2.1:6789/0 mon_data /var/lib/ceph/mon/ceph-0 fsid
1798897a-f0c9-422d-86b3-d4933a12c7ac
2016-07-03 17:09:25.769066 7f8291dff480 1 mon.0@-1(probing) e10preinit
fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac
2016-07-03 17:09:25.769923 7f8291dff480  1
mon.0@-1(probing).paxosservice(pgmap 17869652..17870289) refreshupgraded,
format 0 -> 1
2016-07-03 17:09:25.769947 7f8291dff480  1 mon.0@-1(probing).pg v0
on_upgrade discarding in-core PGMap
2016-07-03 17:09:25.776148 7f8291dff480 0 mon.0@-1(probing).mdse1532
print_map
e1532
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=clientwriteableranges,3=default file layouts on dirs,4=dir inode in separateobject,5=mdsuses versioned encoding,6=dirfrag is stored in omap,8=no anchortable}
Filesystem 'cephfs' (0)
fs_name cephfs
epoch   1530
flags   0
modified        2016-05-19 01:21:31.953710
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
last_failure    1478
last_failure_osd_epoch  26431
compat compat={},rocompat={},incompat={1=base v0.20,2=clientwriteableranges,3=default file layouts on dirs,4=dir inode in separateobject,5=mdsuses versioned encoding,6=dirfrag is stored in omap,8=no anchortable}
max_mds 1
in      0
up      {0=1190233}
failed
damaged
stopped
data_pools      0
metadata_pool   1
inline_data     disabled
1190233: 192.168.2.6:6800/5437 'B' mds.0.1526 up:active seq103145
Standby daemons:
1190222: 192.168.2.5:6801/5871 '2' mds.-1.0 up:standby seq135114
2016-07-03 17:09:25.776444 7f8291dff480 0 mon.0@-1(probing).osde26460
crush map has features 2200130813952, adjusting msgr requires
2016-07-03 17:09:25.776450 7f8291dff480 0 mon.0@-1(probing).osde26460
crush map has features 2200130813952, adjusting msgr requires
2016-07-03 17:09:25.776453 7f8291dff480 0 mon.0@-1(probing).osde26460
crush map has features 2200130813952, adjusting msgr requires
2016-07-03 17:09:25.776454 7f8291dff480 0 mon.0@-1(probing).osde26460
crush map has features 2200130813952, adjusting msgr requires
2016-07-03 17:09:25.776696 7f8291dff480  1
mon.0@-1(probing).paxosservice(auth 19251..19344) refresh upgraded,
  format 0 -> 1
2016-07-03 17:09:25.777650 7f8291dff480 0 mon.0@-1(probing) e10my rank is
now 0 (was -1)
2016-07-03 17:09:25.778191 7f8281e89700  0 -- 192.168.2.1:6789/0 >>
192.168.2.5:6789/0 pipe(0x556e0ab1d360 sd=15 :0 s=1 pgs=0 cs=0 l=0
c=0x556e0ab1e620).fault
2016-07-03 17:09:25.778257 7f828208b700  0 -- 192.168.2.1:6789/0 >>
192.168.2.2:6789/0 pipe(0x556e0ab17c80 sd=13 :0 s=1 pgs=0 cs=0 l=0
c=0x556e0ab18f40).fault
2016-07-03 17:09:25.778314 7f8281d88700  0 -- 192.168.2.1:6789/0 >>
192.168.2.6:6789/0 pipe(0x556e0ab1fe50 sd=16 :0 s=1 pgs=0 cs=0 l=0
c=0x556e0ab21160).fault
2016-07-03 17:09:25.778364 7f8281f8a700  0 -- 192.168.2.1:6789/0 >>
192.168.2.4:6789/0 pipe(0x556e0ab1a7e0 sd=14 :0 s=1 pgs=0 cs=0 l=0
c=0x556e0ab1baa0).fault
2016-07-03 17:10:25.777442 7f828408f700 0mon.0@0(probing).data_health(0)
update_stats avail 71% total 187 GB, used 44468 MB, avail 134 GB
2016-07-03 17:10:26.151363 7f8281486700  0 -- 192.168.2.1:6789/0 >>
192.168.2.6:0/3148669853 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.6:0/3148669853
(socket is 192.168.2.6:37584/0)
2016-07-03 17:10:26.543059 7f8281284700  0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f8264002f40 sd=26 :6789 s=0 pgs=0cs=0 l=0
c=0x7f8264004250).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38419/0)
2016-07-03 17:10:36.543026 7f8281284700  0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38437/0)
2016-07-03 17:10:46.151145 7f8281284700  0 -- 192.168.2.1:6789/0 >>
192.168.2.6:0/3148669853 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.6:0/3148669853
(socket is 192.168.2.6:37586/0)
2016-07-03 17:10:56.543087 7f8281284700  0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38475/0)
2016-07-03 17:11:06.542941 7f8281082700  0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38494/0)
2016-07-03 17:11:25.777920 7f828408f700 0mon.0@0(probing).data_health(0)
update_stats avail 71% total 187 GB, used 44468 MB, avail 134 GB
2016-07-03 17:11:36.543002 7f8281284700  0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38556/0)
2016-07-03 17:11:41.382221 7f8281284700  0 -- 192.168.2.1:6789/0 >>
192.168.2.2:0/976433073 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0cs=0 l=0c=0x7f8264001bd0).accept peer addr is really192.168.2.2:0/976433073 (socket
is 192.168.2.2:58252/0)
2016-07-03 17:11:51.382170 7f8281284700  0 -- 192.168.2.1:6789/0 >>
192.168.2.2:0/976433073 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0cs=0 l=0c=0x7f8264001bd0).accept peer addr is really192.168.2.2:0/976433073 (socket
is 192.168.2.2:58253/0)
2016-07-03 17:12:06.543010 7f8281284700  0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38602/0)
2016-07-03 17:12:16.543013 7f8280f81700  0 -- 192.168.2.1:6789/0 >>
192.168.2.1:0/2736937026 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.1:0/2736937026
(socket is 192.168.2.1:38618/0)
2016-07-03 17:12:25.778424 7f828408f700 0mon.0@0(probing).data_health(0)
update_stats avail 71% total 187 GB, used 44468 MB, avail 134 GB
2016-07-03 17:12:36.151275 7f8280f81700  0 -- 192.168.2.1:6789/0 >>
192.168.2.6:0/3148669853 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.6:0/3148669853
(socket is 192.168.2.6:37615/0)
2016-07-03 17:12:46.151131 7f8280f81700  0 -- 192.168.2.1:6789/0 >>
192.168.2.6:0/3148669853 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0cs=0 l=0
c=0x7f8264001bd0).accept peer addr is really 192.168.2.6:0/3148669853
(socket is 192.168.2.6:37627/0)
2016-07-03 17:12:50.102454 7f8281284700  0 -- 192.168.2.1:6789/0 >>
192.168.2.4:6789/0 pipe(0x7f82640055c0 sd=26 :6789 s=0 pgs=0 cs=0 l=0
c=0x7f8264006a80).accept connect_seq 0 vs existing 0 state connecting
2016-07-03 17:12:50.108055 7f828388e700 0 log_channel(cluster)log [INF] :
mon.0 calling new monitor election
2016-07-03 17:12:50.108235 7f828388e700 1mon.0@0(electing).elector(12184)
init, last seen epoch 12184
2016-07-03 17:12:50.116334 7f828388e700 0 log_channel(cluster)log [INF] :
mon.0 calling new monitor election
2016-07-03 17:12:50.116402 7f828388e700 1mon.0@0(electing).elector(12195)
init, last seen epoch 12195
2016-07-03 17:12:51.382215 7f8280f81700  0 -- 192.168.2.1:6789/0 >>
192.168.2.2:0/976433073 pipe(0x7f82640008c0 sd=25 :6789 s=0 pgs=0cs=0 l=0c=0x7f8264001bd0).accept peer addr is really192.168.2.2:0/976433073 (socket
is 192.168.2.2:58259/0)
2016-07-03 17:12:55.120599 7f828408f700 0 log_channel(cluster)log [INF] :
mon.0@0 won leader election with quorum 0,2,3
2016-07-03 17:12:55.122490 7f828408f700 0 log_channel(cluster)log [INF] :
HEALTH_ERR; 187 pgs are stuck inac
tive for more than 300 seconds; 114 pgs degraded; 138 pgs peering;49 pgs
stale; 138 pgs stuck inactive; 49 pg
s stuck stale; 114 pgs stuck unclean; 114 pgs undersized; recovery
1019372/17575834 objects degraded (5.800%);
too many PGs per OSD (449 > max 300); 1/4 in osds are down;noout flag(s)
set; 2 mons down, quorum 0,2,3 0,3,4
2016-07-03 17:12:55.193094 7f828388e700 0 log_channel(cluster)log [INF] :
monmap e10: 5 mons at
{0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,3=192.168.2.4:6789/0,4=192.168.2.5:6789/0,5=192.168.2.6:6789/0}2016-07-03 17:12:55.193254 7f828388e700 0 log_channel(cluster)log [INF] :
pgmap v17870289: 768 pgs: 49 stale+active+clean, 114
active+undersized+degraded, 138 peering, 467 active+clean; 7128 GBdata,16620 GB used, 4824 GB / 22356 GB avail; 1019372/17575834 objectsdegraded
(5.800%)
2016-07-03 17:12:55.195553 7f828388e700 -1 mon/MDSMonitor.cc: Infunction'boolMDSMonitor::maybe_promote_standby(std::shared_ptr<Filesystem>)'thread
7f828388e700 time 2016-07-03 17:12:55.193360
mon/MDSMonitor.cc: 2796: FAILED assert(info.state ==MDSMap::STATE_STANDBY)
  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x82) [0x556e001f1e12]
2:(MDSMonitor::maybe_promote_standby(std::shared_ptr<Filesystem>)+0x97f)
[0x556dffede53f]
  3: (MDSMonitor::tick()+0x3b6) [0x556dffee0866]
  4: (MDSMonitor::on_active()+0x28) [0x556dffed9038]
  5: (PaxosService::_active()+0x66a) [0x556dffe5968a]
  6: (Context::complete(int)+0x9) [0x556dffe249a9]
  7: (void finish_contexts<Context>(CephContext*, std::list<Context*,
std::allocator<Context*> >&, int)+0xac) [0x556dffe2ba7c]
  8: (Paxos::finish_round()+0xd0) [0x556dffe50460]
  9: (Paxos::handle_last(std::shared_ptr<MonOpRequest>)+0x103d)
[0x556dffe51acd]
10: (Paxos::dispatch(std::shared_ptr<MonOpRequest>)+0x38c)[0x556dffe5254c]
  11: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0xd3b)
[0x556dffe2245b]
  12: (Monitor::_ms_dispatch(Message*)+0x581) [0x556dffe22b91]
  13: (Monitor::ms_dispatch(Message*)+0x23) [0x556dffe41393]
  14: (DispatchQueue::entry()+0x7ba) [0x556e002e722a]
  15: (DispatchQueue::DispatchThread::entry()+0xd) [0x556e001d62cd]
  16: (()+0x74a4) [0x7f8290f904a4]
  17: (clone()+0x6d) [0x7f828f29298d]
NOTE: a copy of the executable, or `objdump -rdS <executable>`is needed to
interpret this.

If asked, I'll dump the rest of the log.



Bill Sharer
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Active MON aborts on Jewel 10.2.2 with FAILED assert(info.state == MDSMap::STATE_STANDBY

Reply via email to