On Tue, Sep 12, 2017 at 2:11 PM, Katie Holly <[email protected]> wrote:
> All radosgw instances are running
>> ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
> as Docker containers, there are 15 of them at any possible time
>
>
> The "config"/exec-args for the radosgw instances are:
>
> /usr/bin/radosgw \
>   -d \
>   --cluster=ceph \
>   --conf=/dev/null \
>   --debug-ms=0 \
>   --debug-rgw=0/0 \
>   --keyring=/etc/ceph/ceph.client.rgw.docker.keyring \
>   --logfile=/dev/null \
>   --mon-host=mon.ceph.fks.de.fvz.io \
>   --name=client.rgw.docker \
>   --rgw-content-length-compat=true \
>   --rgw-dns-name=de-fks-1.rgw.li \
>   --rgw-region=eu \
>   --rgw-zone=eu-de-fks-1 \
>   --setgroup=ceph \
>   --setuser=ceph
>
>
> Scaling this Docker radosgw cluster down to just 1 instance seems to allow 
> ceph-mgr to run without issues, but as soon as I increase the amount of 
> radosgw instances, the risk of ceph-mgr crashing at any random time also 
> increases.
>
> It seems that 2 radosgw instances are also fine, just anything higher than 
> that is not and causes issues. Maybe a race condition?

Maybe. That at least narrows it down.

Could you add this information to the tracker please? The original
description in the tracker appears to show ceph-mgr segfaulting on a
report from an MDS so it's not completely restricted to reports from
rgws.

>
> --
> Katie
> On 2017-09-12 05:24, Brad Hubbard wrote:
>> It seems like it's choking on the report from the rados gateway. What
>> version is the rgw node running?
>>
>> If possible, could you shut down the rgw and see if you can then start 
>> ceph-mgr?
>>
>> Pure stab in the dark just to see if the problem is tied to the rgw instance.
>>
>> On Tue, Sep 12, 2017 at 1:07 PM, Katie Holly <[email protected]> wrote:
>>> Thanks, I totally forgot to check the tracker. I added the information I 
>>> collected there, but don't have enough experience with ceph to dig through 
>>> this myself so let's see if someone is willing to sacrifice their free time 
>>> to help debugging this issue.
>>>
>>> --
>>> Katie
>>>
>>> On 2017-09-12 03:15, Brad Hubbard wrote:
>>>> Looks like there is a tracker opened for this.
>>>>
>>>> http://tracker.ceph.com/issues/21197
>>>>
>>>> Please add your details there.
>>>>
>>>> On Tue, Sep 12, 2017 at 11:04 AM, Katie Holly <[email protected]> wrote:
>>>>> Hi,
>>>>>
>>>>> I recently upgraded one of our clusters from Kraken to Luminous (the 
>>>>> cluster was initialized with Jewel) on Ubuntu 16.04 and deployed ceph-mgr 
>>>>> on all of our ceph-mon nodes with ceph-deploy.
>>>>>
>>>>> Related log entries after initial deployment of ceph-mgr:
>>>>>
>>>>> 2017-09-11 06:41:53.535025 7fb5aa7b8500  0 set uid:gid to 64045:64045 
>>>>> (ceph:ceph)
>>>>> 2017-09-11 06:41:53.535048 7fb5aa7b8500  0 ceph version 12.2.0 
>>>>> (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process 
>>>>> (unknown), pid 17031
>>>>> 2017-09-11 06:41:53.536853 7fb5aa7b8500  0 pidfile_write: ignore empty 
>>>>> --pid-file
>>>>> 2017-09-11 06:41:53.541880 7fb5aa7b8500  1 mgr send_beacon standby
>>>>> 2017-09-11 06:41:54.547383 7fb5a1aec700  1 mgr handle_mgr_map Activating!
>>>>> 2017-09-11 06:41:54.547575 7fb5a1aec700  1 mgr handle_mgr_map I am now 
>>>>> activating
>>>>> 2017-09-11 06:41:54.650677 7fb59dae4700  1 mgr start Creating threads for 
>>>>> 0 modules
>>>>> 2017-09-11 06:41:54.650696 7fb59dae4700  1 mgr send_beacon active
>>>>> 2017-09-11 06:41:55.542252 7fb59eae6700  1 mgr send_beacon active
>>>>> 2017-09-11 06:41:55.542627 7fb59eae6700  1 mgr.server send_report Not 
>>>>> sending PG status to monitor yet, waiting for OSDs
>>>>> 2017-09-11 06:41:57.542697 7fb59eae6700  1 mgr send_beacon active
>>>>> [... lots of "send_beacon active" messages ...]
>>>>> 2017-09-11 07:29:29.640892 7fb59eae6700  1 mgr send_beacon active
>>>>> 2017-09-11 07:29:30.866366 7fb59d2e3700 -1 *** Caught signal (Aborted) **
>>>>>  in thread 7fb59d2e3700 thread_name:ms_dispatch
>>>>>
>>>>>  ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous 
>>>>> (rc)
>>>>>  1: (()+0x3de6b4) [0x55f6640e16b4]
>>>>>  2: (()+0x11390) [0x7fb5a8fef390]
>>>>>  3: (gsignal()+0x38) [0x7fb5a7f7f428]
>>>>>  4: (abort()+0x16a) [0x7fb5a7f8102a]
>>>>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fb5a88c284d]
>>>>>  6: (()+0x8d6b6) [0x7fb5a88c06b6]
>>>>>  7: (()+0x8d701) [0x7fb5a88c0701]
>>>>>  8: (()+0x8d919) [0x7fb5a88c0919]
>>>>>  9: (()+0x2318ad) [0x55f663f348ad]
>>>>>  10: (()+0x3e91bd) [0x55f6640ec1bd]
>>>>>  11: (DaemonPerfCounters::update(MMgrReport*)+0x821) [0x55f663f96651]
>>>>>  12: (DaemonServer::handle_report(MMgrReport*)+0x1ae) [0x55f663f9b79e]+
>>>>>  13: (DaemonServer::ms_dispatch(Message*)+0x64) [0x55f663fa8d64]
>>>>>  14: (DispatchQueue::entry()+0xf4a) [0x55f664438f3a]
>>>>>  15: (DispatchQueue::DispatchThread::entry()+0xd) [0x55f6641dc44d]
>>>>>  16: (()+0x76ba) [0x7fb5a8fe56ba]
>>>>>  17: (clone()+0x6d) [0x7fb5a80513dd]
>>>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed 
>>>>> to interpret this.
>>>>>
>>>>> --- begin dump of recent events ---
>>>>> [...]
>>>>>
>>>>>
>>>>> I tried to manually run ceph-mgr with
>>>>>> /usr/bin/ceph-mgr -f --cluster ceph --id $HOSTNAME --setuser ceph 
>>>>>> --setgroup ceph
>>>>> which immediately fails to keep running for longer than a few seconds.
>>>>> stdout: http://xor.meo.ws/OyvoZF8v0aWq0D-rOOg2y6u03fp_yzYv.txt
>>>>> logs: http://xor.meo.ws/jcMyjabCfFbTcfZ8GOangLdSfSSqJffr.txt
>>>>> objdump: http://xor.meo.ws/oxo2q8h_oKAG6q7mARvNKkR_JdYjn89B.txt
>>>>>
>>>>> Has someone seen such an issue before and knows how to debug or even fix 
>>>>> this?
>>>>>
>>>>>
>>>>> --
>>>>> Katie
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> [email protected]
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>>
>>
>>
>>



-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to