On Tue, Sep 12, 2017 at 2:11 PM, Katie Holly <[email protected]> wrote: > All radosgw instances are running >> ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc) > as Docker containers, there are 15 of them at any possible time > > > The "config"/exec-args for the radosgw instances are: > > /usr/bin/radosgw \ > -d \ > --cluster=ceph \ > --conf=/dev/null \ > --debug-ms=0 \ > --debug-rgw=0/0 \ > --keyring=/etc/ceph/ceph.client.rgw.docker.keyring \ > --logfile=/dev/null \ > --mon-host=mon.ceph.fks.de.fvz.io \ > --name=client.rgw.docker \ > --rgw-content-length-compat=true \ > --rgw-dns-name=de-fks-1.rgw.li \ > --rgw-region=eu \ > --rgw-zone=eu-de-fks-1 \ > --setgroup=ceph \ > --setuser=ceph > > > Scaling this Docker radosgw cluster down to just 1 instance seems to allow > ceph-mgr to run without issues, but as soon as I increase the amount of > radosgw instances, the risk of ceph-mgr crashing at any random time also > increases. > > It seems that 2 radosgw instances are also fine, just anything higher than > that is not and causes issues. Maybe a race condition?
Maybe. That at least narrows it down. Could you add this information to the tracker please? The original description in the tracker appears to show ceph-mgr segfaulting on a report from an MDS so it's not completely restricted to reports from rgws. > > -- > Katie > On 2017-09-12 05:24, Brad Hubbard wrote: >> It seems like it's choking on the report from the rados gateway. What >> version is the rgw node running? >> >> If possible, could you shut down the rgw and see if you can then start >> ceph-mgr? >> >> Pure stab in the dark just to see if the problem is tied to the rgw instance. >> >> On Tue, Sep 12, 2017 at 1:07 PM, Katie Holly <[email protected]> wrote: >>> Thanks, I totally forgot to check the tracker. I added the information I >>> collected there, but don't have enough experience with ceph to dig through >>> this myself so let's see if someone is willing to sacrifice their free time >>> to help debugging this issue. >>> >>> -- >>> Katie >>> >>> On 2017-09-12 03:15, Brad Hubbard wrote: >>>> Looks like there is a tracker opened for this. >>>> >>>> http://tracker.ceph.com/issues/21197 >>>> >>>> Please add your details there. >>>> >>>> On Tue, Sep 12, 2017 at 11:04 AM, Katie Holly <[email protected]> wrote: >>>>> Hi, >>>>> >>>>> I recently upgraded one of our clusters from Kraken to Luminous (the >>>>> cluster was initialized with Jewel) on Ubuntu 16.04 and deployed ceph-mgr >>>>> on all of our ceph-mon nodes with ceph-deploy. >>>>> >>>>> Related log entries after initial deployment of ceph-mgr: >>>>> >>>>> 2017-09-11 06:41:53.535025 7fb5aa7b8500 0 set uid:gid to 64045:64045 >>>>> (ceph:ceph) >>>>> 2017-09-11 06:41:53.535048 7fb5aa7b8500 0 ceph version 12.2.0 >>>>> (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process >>>>> (unknown), pid 17031 >>>>> 2017-09-11 06:41:53.536853 7fb5aa7b8500 0 pidfile_write: ignore empty >>>>> --pid-file >>>>> 2017-09-11 06:41:53.541880 7fb5aa7b8500 1 mgr send_beacon standby >>>>> 2017-09-11 06:41:54.547383 7fb5a1aec700 1 mgr handle_mgr_map Activating! >>>>> 2017-09-11 06:41:54.547575 7fb5a1aec700 1 mgr handle_mgr_map I am now >>>>> activating >>>>> 2017-09-11 06:41:54.650677 7fb59dae4700 1 mgr start Creating threads for >>>>> 0 modules >>>>> 2017-09-11 06:41:54.650696 7fb59dae4700 1 mgr send_beacon active >>>>> 2017-09-11 06:41:55.542252 7fb59eae6700 1 mgr send_beacon active >>>>> 2017-09-11 06:41:55.542627 7fb59eae6700 1 mgr.server send_report Not >>>>> sending PG status to monitor yet, waiting for OSDs >>>>> 2017-09-11 06:41:57.542697 7fb59eae6700 1 mgr send_beacon active >>>>> [... lots of "send_beacon active" messages ...] >>>>> 2017-09-11 07:29:29.640892 7fb59eae6700 1 mgr send_beacon active >>>>> 2017-09-11 07:29:30.866366 7fb59d2e3700 -1 *** Caught signal (Aborted) ** >>>>> in thread 7fb59d2e3700 thread_name:ms_dispatch >>>>> >>>>> ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous >>>>> (rc) >>>>> 1: (()+0x3de6b4) [0x55f6640e16b4] >>>>> 2: (()+0x11390) [0x7fb5a8fef390] >>>>> 3: (gsignal()+0x38) [0x7fb5a7f7f428] >>>>> 4: (abort()+0x16a) [0x7fb5a7f8102a] >>>>> 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fb5a88c284d] >>>>> 6: (()+0x8d6b6) [0x7fb5a88c06b6] >>>>> 7: (()+0x8d701) [0x7fb5a88c0701] >>>>> 8: (()+0x8d919) [0x7fb5a88c0919] >>>>> 9: (()+0x2318ad) [0x55f663f348ad] >>>>> 10: (()+0x3e91bd) [0x55f6640ec1bd] >>>>> 11: (DaemonPerfCounters::update(MMgrReport*)+0x821) [0x55f663f96651] >>>>> 12: (DaemonServer::handle_report(MMgrReport*)+0x1ae) [0x55f663f9b79e]+ >>>>> 13: (DaemonServer::ms_dispatch(Message*)+0x64) [0x55f663fa8d64] >>>>> 14: (DispatchQueue::entry()+0xf4a) [0x55f664438f3a] >>>>> 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x55f6641dc44d] >>>>> 16: (()+0x76ba) [0x7fb5a8fe56ba] >>>>> 17: (clone()+0x6d) [0x7fb5a80513dd] >>>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed >>>>> to interpret this. >>>>> >>>>> --- begin dump of recent events --- >>>>> [...] >>>>> >>>>> >>>>> I tried to manually run ceph-mgr with >>>>>> /usr/bin/ceph-mgr -f --cluster ceph --id $HOSTNAME --setuser ceph >>>>>> --setgroup ceph >>>>> which immediately fails to keep running for longer than a few seconds. >>>>> stdout: http://xor.meo.ws/OyvoZF8v0aWq0D-rOOg2y6u03fp_yzYv.txt >>>>> logs: http://xor.meo.ws/jcMyjabCfFbTcfZ8GOangLdSfSSqJffr.txt >>>>> objdump: http://xor.meo.ws/oxo2q8h_oKAG6q7mARvNKkR_JdYjn89B.txt >>>>> >>>>> Has someone seen such an issue before and knows how to debug or even fix >>>>> this? >>>>> >>>>> >>>>> -- >>>>> Katie >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> [email protected] >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> >>>> >> >> >> -- Cheers, Brad _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
