[ceph-users] mgr abort during upgrade 12.2.5 -> 12.2.7 due to multiple active RGW clones

Burkhard Linke Wed, 01 Aug 2018 01:19:43 -0700

Hi,

I'm currently upgrading our ceph cluster to 12.2.7. Most steps are fine,but all mgr instances abort after restarting:



....

-10> 2018-08-01 09:57:46.357696 7fc481221700 5 --192.168.6.134:6856/5968 >> 192.168.6.131:6814/2743 conn(0x564cf2bf9000:6856 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=94 cs=1 l=1). rxosd.70 seq 24 0x564cf4

708c00 mgrreport(osd.70 +0-0 packed 742 osd_metrics=1) v5

-9> 2018-08-01 09:57:46.357715 7fc46bf8e700 1 --192.168.6.134:6856/5968 <== osd.70 192.168.6.131:6814/2743 24 ====mgrreport(osd.70 +0-0 packed 742 osd_metrics=1) v5 ==== 784+0+0(3768598180 0 0) 0x564cf4708c00 con

 0x564cf2bf9000

-8> 2018-08-01 09:57:46.357721 7fc46bf8e700 4 mgr.serverhandle_report from 0x564cf2bf9000 osd,70 -7> 2018-08-01 09:57:46.358255 7fc481221700 5 --192.168.6.134:6856/5968 >> 192.168.6.137:6800/2921 conn(0x564cf2c90000:6856 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=28 cs=1 l=1). rxosd.20 seq 25 0x564cf4

da63c0 pg_stats(72 pgs tid 0 v 0) v1

-6> 2018-08-01 09:57:46.358303 7fc46bf8e700 1 --192.168.6.134:6856/5968 <== osd.20 192.168.6.137:6800/2921 25 ====pg_stats(72 pgs tid 0 v 0) v1 ==== 42756+0+0 (3715458660 0 0)0x564cf4da63c0 con 0x564cf2c90000 -5> 2018-08-01 09:57:46.358432 7fc481221700 5 --192.168.6.134:6856/5968 >> 192.168.6.131:6814/2743 conn(0x564cf2bf9000:6856 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=94 cs=1 l=1). rxosd.70 seq 25 0x564cf4

db2ec0 pg_stats(54 pgs tid 0 v 0) v1

-4> 2018-08-01 09:57:46.358447 7fc46bf8e700 1 --192.168.6.134:6856/5968 <== osd.70 192.168.6.131:6814/2743 25 ====pg_stats(54 pgs tid 0 v 0) v1 ==== 32928+0+0 (3225946058 0 0)0x564cf4db2ec0 con 0x564cf2bf9000 -3> 2018-08-01 09:57:46.368820 7fc480a20700 5 --192.168.6.134:6856/5968 >> 192.168.6.135:0/1209706031conn(0x564cf2f8d000 :6856 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCHpgs=28 cs=1 l=1). rx client.19838915 seq

 13 0x564cf44cd500 mgrreport(rgw.radosgw.gateway +0-0 packed 3382) v5

-2> 2018-08-01 09:57:46.368880 7fc46bf8e700 1 --192.168.6.134:6856/5968 <== client.19838915 192.168.6.135:0/120970603113 ==== mgrreport(rgw.radosgw.gateway +0-0 packed 3382) v5 ==== 3425+0+0(3985820496 0 0) 0x564

cf44cd500 con 0x564cf2f8d000

-1> 2018-08-01 09:57:46.368895 7fc46bf8e700 4 mgr.serverhandle_report from 0x564cf2f8d000 rgw,radosgw.gateway 0> 2018-08-01 09:57:46.371034 7fc46bf8e700 -1 *** Caught signal(Aborted) **

 in thread 7fc46bf8e700 thread_name:ms_dispatch

ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)luminous (stable)

 1: (()+0x40e744) [0x564ce68e1744]
 2: (()+0x11390) [0x7fc484ede390]
 3: (gsignal()+0x38) [0x7fc483e6e428]
 4: (abort()+0x16a) [0x7fc483e7002a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fc4847b184d]
 6: (()+0x8d6b6) [0x7fc4847af6b6]
 7: (()+0x8d701) [0x7fc4847af701]
 8: (()+0x8d919) [0x7fc4847af919]
 9: (std::__throw_out_of_range(char const*)+0x3f) [0x7fc4847d82cf]
 10: (DaemonPerfCounters::update(MMgrReport*)+0x197c) [0x564ce6775dec]
 11: (DaemonServer::handle_report(MMgrReport*)+0x269) [0x564ce677e3d9]
 12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x564ce678c5a7]
 13: (DispatchQueue::entry()+0xf4a) [0x564ce6c3baba]
 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x564ce69dcaed]
 15: (()+0x76ba) [0x7fc484ed46ba]
 16: (clone()+0x6d) [0x7fc483f4041d]

NOTE: a copy of the executable, or `objdump -rdS <executable>` isneeded to interpret this.

The cause seems to be the RGW instances in our cluster. We use a HAsetup with pacemaker and haproxy on three hosts; two different RGW setupserve internal and external users (three hosts with two RGW processeseach overall). As soon as two instance on two different hosts areactive, the mgrs crash with this stack trace.

I've reduced the number of active RGW instances to one to stop the mgrfrom crashing. Is this a regression in 12.2.7 for HA setups?



Regards,

Burkhard


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] mgr abort during upgrade 12.2.5 -> 12.2.7 due to multiple active RGW clones

Reply via email to