[ceph-users] Re: All MGR loop crash

2024-03-07 Thread Eugen Block

Thanks! That's very interesting to know!

Zitat von "David C." :


some monitors have existed for many years (weight 10) others have been
added (weight 0)

=> https://github.com/ceph/ceph/commit/2d113dedf851995e000d3cce136b69
bfa94b6fe0

Le jeudi 7 mars 2024, Eugen Block  a écrit :


I’m curious how the weights might have been changed. I’ve never touched a
mon weight myself, do you know how that happened?

Zitat von "David C." :

Ok, got it :


[root@pprod-admin:/var/lib/ceph/]# ceph mon dump -f json-pretty
|egrep "name|weigh"
dumped monmap epoch 14
"min_mon_release_name": "quincy",
"name": "pprod-mon2",
"weight": 10,
"name": "pprod-mon3",
"weight": 10,
"name": "pprod-osd2",
"weight": 0,
"name": "pprod-osd1",
"weight": 0,
"name": "pprod-osd3",
"weight": 0,

ceph mon set-weight pprod-mon2 0
ceph mon set-weight pprod-mon3 0

And restart ceph-mgr

Le jeu. 7 mars 2024 à 18:25, David C.  a écrit :

I took the wrong ligne =>

https://github.com/ceph/ceph/blob/v17.2.6/src/mon/MonClient.cc#L822


Le jeu. 7 mars 2024 à 18:21, David C.  a écrit :



Hello everybody,

I'm encountering strange behavior on an infrastructure (it's
pre-production but it's very ugly). After a "drain" on monitor (and a
manager). MGRs all crash on startup:

Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr ms_dispatch2 standby
mgrmap(e 1310) v1
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map received
map epoch 1310
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map active
in
map: 1 active is 99148504
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map
Activating!
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map I am now
activating
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr ms_dispatch2 mgrmap(e
1310) v1
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc handle_mgr_map Got map
version 1310
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc handle_mgr_map Active
mgr is now
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc reconnect No active
mgr
available yet
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr init waiting for
OSDMap...
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _renew_subs
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient:
_send_mon_message
to mon.idf-pprod-osd3 at v2:X.X.X.X:3300/0 
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _reopen_session
rank -1
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: *** Caught signal (Aborted)
**
  in thread 7f9a07a27640
thread_name:mgr-fin

  ceph version
17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy
(stable)
  1:
/lib64/libc.so.6(+0x54db0) [0x7f9a2364ddb0]
  2:
/lib64/libc.so.6(+0xa154c) [0x7f9a2369a54c]
  3: raise()
  4: abort()
  5:
/usr/lib64/ceph/libceph-common.so.2(+0x1c1fa8) [0x7f9a23ce2fa8]
  6:
/usr/lib64/ceph/libceph-common.so.2(+0x25) [0x7f9a23f65425]
  7:
/usr/lib64/ceph/libceph-common.so.2(+0x4442e0) [0x7f9a23f652e0]
  8:
/usr/lib64/ceph/libceph-common.so.2(+0x4442e0) [0x7f9a23f652e0]
  9:
(MonClient::_add_conns()+0x242) [0x7f9a23f5fa42]
  10:
(MonClient::_reopen_session(int)+0x428) [0x7f9a23f60518]
  11:
(Mgr::init()+0x384)
[0x5604667a6434]
  12:
/usr/bin/ceph-mgr(+0x1af271) [0x5604667ae271]
  13:
/usr/bin/ceph-mgr(+0x11364d) [0x56046671264d]
  14:
(Finisher::finisher_thread_entry()+0x175) [0x7f9a23d10645]
  15:
/lib64/libc.so.6(+0x9f802) [0x7f9a23698802]
  16:
/lib64/libc.so.6(+0x3f450) [0x7f9a23638450]
  NOTE: a copy of the
executable, or `objdump -rdS ` is needed to interpret this.

I have the impression that the MGRs are ejected by the monitors, however
after debugging monitor, I don't see anything abnormal on the monitor
side
(if I haven't missed something).

All we can see is that we get an exception on the "_add_conn" method (
https://github.com/ceph/ceph/blob/v17.2.6/src/mon/MonClient.cc #L775)

Version : 17.2.6-170.el9cp (RHCS6)




___

ceph-users mailing list -- 

[ceph-users] Re: All MGR loop crash

2024-03-07 Thread Dieter Roels
Public

We ran into this issue last week when upgrading to quincy.  We asked ourselves 
the same question: how did the weight change, as we did not even know that was 
a thing.

We checked our other clusters and we have some where all the mons have a weight 
of 10, and there it is not an issue.  So only certain combinations of weights 
cause this crash.

regards,

Dieter Roels

-Original Message-
From: Eugen Block 
Sent: Thursday, 7 March 2024 19:13
To: ceph-users@ceph.io
Subject: [ceph-users] Re: All MGR loop crash



I’m curious how the weights might have been changed. I’ve never touched a mon 
weight myself, do you know how that happened?



Disclaimer <https://www.kbc.com/KBCmailDisclaimer>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: All MGR loop crash

2024-03-07 Thread Eugen Block
I’m curious how the weights might have been changed. I’ve never  
touched a mon weight myself, do you know how that happened?


Zitat von "David C." :


Ok, got it :

[root@pprod-admin:/var/lib/ceph/]# ceph mon dump -f json-pretty
|egrep "name|weigh"
dumped monmap epoch 14
"min_mon_release_name": "quincy",
"name": "pprod-mon2",
"weight": 10,
"name": "pprod-mon3",
"weight": 10,
"name": "pprod-osd2",
"weight": 0,
"name": "pprod-osd1",
"weight": 0,
"name": "pprod-osd3",
"weight": 0,

ceph mon set-weight pprod-mon2 0
ceph mon set-weight pprod-mon3 0

And restart ceph-mgr

Le jeu. 7 mars 2024 à 18:25, David C.  a écrit :


I took the wrong ligne =>
https://github.com/ceph/ceph/blob/v17.2.6/src/mon/MonClient.cc#L822


Le jeu. 7 mars 2024 à 18:21, David C.  a écrit :



Hello everybody,

I'm encountering strange behavior on an infrastructure (it's
pre-production but it's very ugly). After a "drain" on monitor (and a
manager). MGRs all crash on startup:

Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr ms_dispatch2 standby
mgrmap(e 1310) v1
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map received
map epoch 1310
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map active in
map: 1 active is 99148504
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map
Activating!
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map I am now
activating
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr ms_dispatch2 mgrmap(e
1310) v1
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc handle_mgr_map Got map
version 1310
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc handle_mgr_map Active
mgr is now
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc reconnect No active mgr
available yet
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr init waiting for
OSDMap...
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _renew_subs
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _send_mon_message
to mon.idf-pprod-osd3 at v2:X.X.X.X:3300/0 
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _reopen_session
rank -1
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: *** Caught signal (Aborted)
**
  in thread 7f9a07a27640
thread_name:mgr-fin

  ceph version
17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)
  1:
/lib64/libc.so.6(+0x54db0) [0x7f9a2364ddb0]
  2:
/lib64/libc.so.6(+0xa154c) [0x7f9a2369a54c]
  3: raise()
  4: abort()
  5:
/usr/lib64/ceph/libceph-common.so.2(+0x1c1fa8) [0x7f9a23ce2fa8]
  6:
/usr/lib64/ceph/libceph-common.so.2(+0x25) [0x7f9a23f65425]
  7:
/usr/lib64/ceph/libceph-common.so.2(+0x4442e0) [0x7f9a23f652e0]
  8:
/usr/lib64/ceph/libceph-common.so.2(+0x4442e0) [0x7f9a23f652e0]
  9:
(MonClient::_add_conns()+0x242) [0x7f9a23f5fa42]
  10:
(MonClient::_reopen_session(int)+0x428) [0x7f9a23f60518]
  11: (Mgr::init()+0x384)
[0x5604667a6434]
  12:
/usr/bin/ceph-mgr(+0x1af271) [0x5604667ae271]
  13:
/usr/bin/ceph-mgr(+0x11364d) [0x56046671264d]
  14:
(Finisher::finisher_thread_entry()+0x175) [0x7f9a23d10645]
  15:
/lib64/libc.so.6(+0x9f802) [0x7f9a23698802]
  16:
/lib64/libc.so.6(+0x3f450) [0x7f9a23638450]
  NOTE: a copy of the
executable, or `objdump -rdS ` is needed to interpret this.

I have the impression that the MGRs are ejected by the monitors, however
after debugging monitor, I don't see anything abnormal on the monitor side
(if I haven't missed something).

All we can see is that we get an exception on the "_add_conn" method (
https://github.com/ceph/ceph/blob/v17.2.6/src/mon/MonClient.cc #L775)

Version : 17.2.6-170.el9cp (RHCS6)





___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: All MGR loop crash

2024-03-07 Thread David C.
Ok, got it :

[root@pprod-admin:/var/lib/ceph/]# ceph mon dump -f json-pretty
|egrep "name|weigh"
dumped monmap epoch 14
"min_mon_release_name": "quincy",
"name": "pprod-mon2",
"weight": 10,
"name": "pprod-mon3",
"weight": 10,
"name": "pprod-osd2",
"weight": 0,
"name": "pprod-osd1",
"weight": 0,
"name": "pprod-osd3",
"weight": 0,

ceph mon set-weight pprod-mon2 0
ceph mon set-weight pprod-mon3 0

And restart ceph-mgr

Le jeu. 7 mars 2024 à 18:25, David C.  a écrit :

> I took the wrong ligne =>
> https://github.com/ceph/ceph/blob/v17.2.6/src/mon/MonClient.cc#L822
>
>
> Le jeu. 7 mars 2024 à 18:21, David C.  a écrit :
>
>>
>> Hello everybody,
>>
>> I'm encountering strange behavior on an infrastructure (it's
>> pre-production but it's very ugly). After a "drain" on monitor (and a
>> manager). MGRs all crash on startup:
>>
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr ms_dispatch2 standby
>> mgrmap(e 1310) v1
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map received
>> map epoch 1310
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map active in
>> map: 1 active is 99148504
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map
>> Activating!
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map I am now
>> activating
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr ms_dispatch2 mgrmap(e
>> 1310) v1
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc handle_mgr_map Got map
>> version 1310
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc handle_mgr_map Active
>> mgr is now
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc reconnect No active mgr
>> available yet
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr init waiting for
>> OSDMap...
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _renew_subs
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _send_mon_message
>> to mon.idf-pprod-osd3 at v2:X.X.X.X:3300/0 
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _reopen_session
>> rank -1
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: *** Caught signal (Aborted)
>> **
>>   in thread 7f9a07a27640
>> thread_name:mgr-fin
>>
>>   ceph version
>> 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)
>>   1:
>> /lib64/libc.so.6(+0x54db0) [0x7f9a2364ddb0]
>>   2:
>> /lib64/libc.so.6(+0xa154c) [0x7f9a2369a54c]
>>   3: raise()
>>   4: abort()
>>   5:
>> /usr/lib64/ceph/libceph-common.so.2(+0x1c1fa8) [0x7f9a23ce2fa8]
>>   6:
>> /usr/lib64/ceph/libceph-common.so.2(+0x25) [0x7f9a23f65425]
>>   7:
>> /usr/lib64/ceph/libceph-common.so.2(+0x4442e0) [0x7f9a23f652e0]
>>   8:
>> /usr/lib64/ceph/libceph-common.so.2(+0x4442e0) [0x7f9a23f652e0]
>>   9:
>> (MonClient::_add_conns()+0x242) [0x7f9a23f5fa42]
>>   10:
>> (MonClient::_reopen_session(int)+0x428) [0x7f9a23f60518]
>>   11: (Mgr::init()+0x384)
>> [0x5604667a6434]
>>   12:
>> /usr/bin/ceph-mgr(+0x1af271) [0x5604667ae271]
>>   13:
>> /usr/bin/ceph-mgr(+0x11364d) [0x56046671264d]
>>   14:
>> (Finisher::finisher_thread_entry()+0x175) [0x7f9a23d10645]
>>   15:
>> /lib64/libc.so.6(+0x9f802) [0x7f9a23698802]
>>   16:
>> /lib64/libc.so.6(+0x3f450) [0x7f9a23638450]
>>   NOTE: a copy of the
>> executable, or `objdump -rdS ` is needed to interpret this.
>>
>> I have the impression that the MGRs are ejected by the monitors, however
>> after debugging monitor, I don't see anything abnormal on the monitor side
>> (if I haven't missed something).
>>
>> All we can see is that we get an exception on the "_add_conn" method (
>> https://github.com/ceph/ceph/blob/v17.2.6/src/mon/MonClient.cc #L775)
>>
>> Version : 17.2.6-170.el9cp (RHCS6)
>>
>>
>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: All MGR loop crash

2024-03-07 Thread David C.
I took the wrong ligne =>
https://github.com/ceph/ceph/blob/v17.2.6/src/mon/MonClient.cc#L822


Le jeu. 7 mars 2024 à 18:21, David C.  a écrit :

>
> Hello everybody,
>
> I'm encountering strange behavior on an infrastructure (it's
> pre-production but it's very ugly). After a "drain" on monitor (and a
> manager). MGRs all crash on startup:
>
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr ms_dispatch2 standby
> mgrmap(e 1310) v1
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map received
> map epoch 1310
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map active in
> map: 1 active is 99148504
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map Activating!
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map I am now
> activating
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr ms_dispatch2 mgrmap(e
> 1310) v1
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc handle_mgr_map Got map
> version 1310
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc handle_mgr_map Active
> mgr is now
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc reconnect No active mgr
> available yet
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr init waiting for OSDMap...
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _renew_subs
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _send_mon_message
> to mon.idf-pprod-osd3 at v2:X.X.X.X:3300/0 
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _reopen_session
> rank -1
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: *** Caught signal (Aborted) **
>   in thread 7f9a07a27640
> thread_name:mgr-fin
>
>   ceph version
> 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)
>   1:
> /lib64/libc.so.6(+0x54db0) [0x7f9a2364ddb0]
>   2:
> /lib64/libc.so.6(+0xa154c) [0x7f9a2369a54c]
>   3: raise()
>   4: abort()
>   5:
> /usr/lib64/ceph/libceph-common.so.2(+0x1c1fa8) [0x7f9a23ce2fa8]
>   6:
> /usr/lib64/ceph/libceph-common.so.2(+0x25) [0x7f9a23f65425]
>   7:
> /usr/lib64/ceph/libceph-common.so.2(+0x4442e0) [0x7f9a23f652e0]
>   8:
> /usr/lib64/ceph/libceph-common.so.2(+0x4442e0) [0x7f9a23f652e0]
>   9:
> (MonClient::_add_conns()+0x242) [0x7f9a23f5fa42]
>   10:
> (MonClient::_reopen_session(int)+0x428) [0x7f9a23f60518]
>   11: (Mgr::init()+0x384)
> [0x5604667a6434]
>   12:
> /usr/bin/ceph-mgr(+0x1af271) [0x5604667ae271]
>   13:
> /usr/bin/ceph-mgr(+0x11364d) [0x56046671264d]
>   14:
> (Finisher::finisher_thread_entry()+0x175) [0x7f9a23d10645]
>   15:
> /lib64/libc.so.6(+0x9f802) [0x7f9a23698802]
>   16:
> /lib64/libc.so.6(+0x3f450) [0x7f9a23638450]
>   NOTE: a copy of the
> executable, or `objdump -rdS ` is needed to interpret this.
>
> I have the impression that the MGRs are ejected by the monitors, however
> after debugging monitor, I don't see anything abnormal on the monitor side
> (if I haven't missed something).
>
> All we can see is that we get an exception on the "_add_conn" method (
> https://github.com/ceph/ceph/blob/v17.2.6/src/mon/MonClient.cc #L775)
>
> Version : 17.2.6-170.el9cp (RHCS6)
>
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io