[ceph-users] Re: Advice needed: stuck cluster halfway upgraded, comms issues and MON space usage

Dan van der Ster Tue, 23 Mar 2021 00:30:44 -0700

Hi Sam,

Yeah somehow `lo:` is not getting skipped, probably due to those
patches. (I guess it is because the 2nd patch looks for `lo:` but in
fact the ifa_name is probably just `lo` without the colon)


    https://github.com/ceph/ceph/blob/master/src/common/ipaddr.cc#L110

I don't know why this impacts you but not us -- we already upgraded
one of our clusters to 14.2.18 on Centos 8, and ceph is choosing the
correct interface without needing any network options. And lo: is the
first interface [1] here too.
Could it be as simple as the iface names being sorted alphabetically?
Here we have ens785f0 which would come before lo, but your interface
`p2p2` would come after.

-- dan

[1]
 # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state
DOWN group default qlen 1000
    link/ether a4:bf:01:60:67:a0 brd ff:ff:ff:ff:ff:ff
3: ens785f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state
UP group default qlen 1000
    link/ether 0c:42:a1:ad:36:9a brd ff:ff:ff:ff:ff:ff
    inet 10.116.6.8/26 brd 10.116.6.63 scope global dynamic
noprefixroute ens785f0
       valid_lft 432177sec preferred_lft 432177sec
    inet6 fd01:1458:e00:1e::100:5/128 scope global dynamic noprefixroute
       valid_lft 513502sec preferred_lft 513502sec
    inet6 fe80::bdbd:76be:63fd:a4c2/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
4: ens785f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq
state DOWN group default qlen 1000
    link/ether 0c:42:a1:ad:36:9b brd ff:ff:ff:ff:ff:ff
5: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state
DOWN group default qlen 1000
    link/ether a4:bf:01:60:67:a1 brd ff:ff:ff:ff:ff:ff\

On Mon, Mar 22, 2021 at 8:35 PM Sam Skipsey <[email protected]> wrote:
>
> Hi Dan:
>
> Aha - I think the first commit is probably it - before that commit, the fact 
> that lo is highest in the interfaces enumeration didn't matter for us [since 
> it would always be skipped].
>
> This actually almost certainly also is associated with that other site with a 
> similar problem (OSDs drop out until you restart the network interface), 
> since I imagine that would reorder the interface list.
>
> Playing with our public and cluster bind address explicitly does seem to 
> help, so we'll iterate on that and get to a suitable ceph.conf.
>
> Thanks for the help [and it was the network all along]!
>
>
> Sam
>
> On Mon, 22 Mar 2021 at 19:12, Dan van der Ster <[email protected]> wrote:
>>
>> There are two commits between 14.2.16 and 14.2.18 related to loopback 
>> network. Perhaps one of these is responsible for your issue [1].
>>
>> I'd try playing with the options like cluster/public bind addr and 
>> cluster/public bind interface until you can convince the osd to bind to the 
>> correct listening IP.
>>
>> (That said, i don't know which version you're running on the logs shared 
>> earlier. But I think you should try to get 14.2.18 working anyway).
>>
>> .. dan
>>
>> [1]
>>
>> > git log v14.2.18...v14.2.16 ipaddr.cc                   commit 
>> > 89321762ad4cfdd1a68cae467181bdd1a501f14d
>> Author: Thomas Goirand <[email protected]>
>> Date:   Fri Jan 15 10:50:05 2021 +0100
>>
>>     common/ipaddr: Allow binding on lo
>>
>>     Commmit 5cf0fa872231f4eaf8ce6565a04ed675ba5b689b, solves the issue that
>>     the osd can't restart after seting a virtual local loopback IP. However,
>>     this commit also prevents a bgp-to-the-host over unumbered Ipv6
>>     local-link is setup, where OSD typically are bound to the lo interface.
>>
>>     To solve this, this single char patch simply checks against "lo:" to
>>     match only virtual interfaces instead of anything that starts with "lo".
>>
>>     Fixes: https://tracker.ceph.com/issues/48893
>>     Signed-off-by: Thomas Goirand <[email protected]>
>>     (cherry picked from commit 201b59204374ebdab91bb554b986577a97b19c36)
>>
>> commit b52cae90d67eb878b3ddfe547b8bf16e0d4d1a45
>> Author: lijaiwei1 <[email protected]>
>> Date:   Tue Dec 24 22:34:46 2019 +0800
>>
>>     common: skip interfaces starting with "lo" in find_ipv{4,6}_in_subnet()
>>
>>     This will solve the issue that the osd can't restart after seting a
>>     virtual local loopback IP.
>>     In find_ipv4_in_subnet() and find_ipv6_in_subnet(), I use
>>     boost::starts_with(addrs->ifa_name, "lo") to ship the interfaces
>>     starting with "lo".
>>
>>     Fixes: https://tracker.ceph.com/issues/43417
>>     Signed-off-by: Jiawei Li <[email protected]>
>>     (cherry picked from commit 5cf0fa872231f4eaf8ce6565a04ed675ba5b689b)
>>
>>
>>
>>
>>
>> On Mon, Mar 22, 2021, 7:42 PM Sam Skipsey <[email protected]> wrote:
>>>
>>> I don't think we explicitly set any ms settings in the OSD host ceph.conf 
>>> [all the OSDs ceph.confs are identical across the entire cluster].
>>>
>>> ip a gives:
>>>
>>>  ip a
>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group 
>>> default qlen 1000
>>>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>     inet 127.0.0.1/8 scope host lo
>>>        valid_lft forever preferred_lft forever
>>>     inet6 ::1/128 scope host
>>>        valid_lft forever preferred_lft forever
>>> 2: em1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN 
>>> group default qlen 1000
>>>     link/ether 4c:d9:8f:55:92:f6 brd ff:ff:ff:ff:ff:ff
>>> 3: em2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN 
>>> group default qlen 1000
>>>     link/ether 4c:d9:8f:55:92:f7 brd ff:ff:ff:ff:ff:ff
>>> 4: p2p1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN 
>>> group default qlen 1000
>>>     link/ether b4:96:91:3f:62:20 brd ff:ff:ff:ff:ff:ff
>>> 5: p2p2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group 
>>> default qlen 1000
>>>     link/ether b4:96:91:3f:62:22 brd ff:ff:ff:ff:ff:ff
>>>     inet 10.1.50.21/8 brd 10.255.255.255 scope global noprefixroute p2p2
>>>        valid_lft forever preferred_lft forever
>>>     inet6 fe80::b696:91ff:fe3f:6222/64 scope link noprefixroute
>>>        valid_lft forever preferred_lft forever
>>>
>>> (where here p2p2 is the only active network link, and is also the private 
>>> and public network for the ceph cluster)
>>>
>>> The output is similar on other hosts - with p2p2 either at position 3 or 5 
>>> depending on the order the interfaces were enumerated.
>>>
>>> Sam
>>>
>>> On Mon, 22 Mar 2021 at 17:34, Dan van der Ster <[email protected]> wrote:
>>>>
>>>> Which `ms` settings do you have in the OSD host's ceph.conf or the ceph 
>>>> config dump?
>>>>
>>>> And how does `ip a` look on one of these hosts where the osd is 
>>>> registering itself as 127.0.0.1?
>>>>
>>>>
>>>> You might as well set nodown again now. This will make ops pile up, but 
>>>> that's the least of your concerns at the moment.
>>>> (With osds flapping the osdmaps churn and that inflates the mon store)
>>>>
>>>> .. Dan
>>>>
>>>> On Mon, Mar 22, 2021, 6:28 PM Sam Skipsey <[email protected]> wrote:
>>>>>
>>>>> Hm, yes it does [and I was wondering why loopbacks were showing up 
>>>>> suddenly in the logs]. This wasn't happening with 14.2.16 so what's 
>>>>> changed about how we specify stuff?
>>>>>
>>>>> This might correlate with the other person on the IRC list who has 
>>>>> problems with 14.2.18 and their OSDs deciding they don't work sometimes 
>>>>> until they forcibly restart their network links...
>>>>>
>>>>>
>>>>> Sam
>>>>>
>>>>> On Mon, 22 Mar 2021 at 17:20, Dan van der Ster <[email protected]> 
>>>>> wrote:
>>>>>>
>>>>>> What's with the OSDs having loopback addresses? E.g. 
>>>>>> v2:127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667
>>>>>>
>>>>>> Does `ceph osd dump` show those same loopback addresses for each OSD?
>>>>>>
>>>>>> This sounds familiar... I'm trying to find the recent ticket.
>>>>>>
>>>>>> .. dan
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 22, 2021, 6:07 PM Sam Skipsey <[email protected]> wrote:
>>>>>>>
>>>>>>> hi Dan:
>>>>>>>
>>>>>>> So, unsetting nodown results in... almost all of the OSDs being marked 
>>>>>>> down. (231 down out of 328).
>>>>>>> Checking the actual OSD services, most of them were actually up and 
>>>>>>> active on the nodes, even when the mons had marked them down.
>>>>>>> (On a few nodes, the down services corresponded to OSDs that had been 
>>>>>>> flapping - but increasing osd_max_markdown locally to keep them up 
>>>>>>> despite the previous flapping, and restarting the services... didn't 
>>>>>>> help.)
>>>>>>>
>>>>>>> In fact, starting up the few OSD services which had actually stopped, 
>>>>>>> resulted in a different set of OSDs being marked down, and some others 
>>>>>>> coming up.
>>>>>>> We currently have a sort of "rolling OSD outness" passing through the 
>>>>>>> cluster - there's always ~230 OSDs marked down now, but which ones 
>>>>>>> those are changes (we've had everything from 1 HOST down to 4 HOSTS 
>>>>>>> down over the past 14 minutes as things fluctuate.
>>>>>>>
>>>>>>> A log from one of the "down" OSDs [which is actually running, and on 
>>>>>>> the same host as OSDs which are marked up] shows this worrying snippet
>>>>>>>
>>>>>>> 2021-03-22 17:01:45.298 7f6c9c883700  1 osd.127 253515 is_healthy false 
>>>>>>> -- only 0/10 up peers (less than 33%)
>>>>>>> 2021-03-22 17:01:45.298 7f6c9c883700  1 osd.127 253515 not healthy; 
>>>>>>> waiting to boot
>>>>>>> 2021-03-22 17:01:46.340 7f6c9c883700  1 osd.127 253515 is_healthy false 
>>>>>>> -- only 0/10 up peers (less than 33%)
>>>>>>> 2021-03-22 17:01:46.340 7f6c9c883700  1 osd.127 253515 not healthy; 
>>>>>>> waiting to boot
>>>>>>> 2021-03-22 17:01:47.376 7f6c9c883700  1 osd.127 253515 is_healthy false 
>>>>>>> -- only 0/10 up peers (less than 33%)
>>>>>>> 2021-03-22 17:01:47.376 7f6c9c883700  1 osd.127 253515 not healthy; 
>>>>>>> waiting to boot
>>>>>>> 2021-03-22 17:01:48.395 7f6c9c883700  1 osd.127 253515 is_healthy false 
>>>>>>> -- only 0/10 up peers (less than 33%)
>>>>>>> 2021-03-22 17:01:48.395 7f6c9c883700  1 osd.127 253515 not healthy; 
>>>>>>> waiting to boot
>>>>>>> 2021-03-22 17:01:49.407 7f6c9c883700  1 osd.127 253515 is_healthy false 
>>>>>>> -- only 0/10 up peers (less than 33%)
>>>>>>> 2021-03-22 17:01:49.407 7f6c9c883700  1 osd.127 253515 not healthy; 
>>>>>>> waiting to boot
>>>>>>> 2021-03-22 17:01:50.400 7f6c9c883700  1 osd.127 253515 is_healthy false 
>>>>>>> -- only 0/10 up peers (less than 33%)
>>>>>>> 2021-03-22 17:01:50.400 7f6c9c883700  1 osd.127 253515 not healthy; 
>>>>>>> waiting to boot
>>>>>>> 2021-03-22 17:01:50.922 7f6c9f088700 -1 --2- 10.1.50.21:0/23673 >> 
>>>>>>> [v2:127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] 
>>>>>>> conn(0x56010903e400 0x56011a71fc00 unknown :-1 s=BANNER_CONNECTING 
>>>>>>> pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer 
>>>>>>> [v2:127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] is using msgr 
>>>>>>> V1 protocol
>>>>>>> 2021-03-22 17:01:50.922 7f6c9f889700 -1 --2- 10.1.50.21:0/23673 >> 
>>>>>>> [v2:127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] 
>>>>>>> conn(0x5600df434000 0x56011718e000 unknown :-1 s=BANNER_CONNECTING 
>>>>>>> pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer 
>>>>>>> [v2:127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] is using msgr 
>>>>>>> V1 protocol
>>>>>>> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> 
>>>>>>> [v2:127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] 
>>>>>>> conn(0x5600f85ed800 0x560109df2a00 unknown :-1 s=BANNER_CONNECTING 
>>>>>>> pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer 
>>>>>>> [v2:127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] is using msgr 
>>>>>>> V1 protocol
>>>>>>> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> 
>>>>>>> [v2:127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] 
>>>>>>> conn(0x5600f22ea000 0x560117182300 unknown :-1 s=BANNER_CONNECTING 
>>>>>>> pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer 
>>>>>>> [v2:127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] is using msgr V1 
>>>>>>> protocol
>>>>>>> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> 
>>>>>>> [v2:127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] 
>>>>>>> conn(0x5600df435c00 0x560139370300 unknown :-1 s=BANNER_CONNECTING 
>>>>>>> pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer 
>>>>>>> [v2:127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] is using msgr 
>>>>>>> V1 protocol
>>>>>>> 2021-03-22 17:01:51.377 7f6c9c883700  1 osd.127 253515 is_healthy false 
>>>>>>> -- only 0/10 up peers (less than 33%)
>>>>>>> 2021-03-22 17:01:51.377 7f6c9c883700  1 osd.127 253515 not healthy; 
>>>>>>> waiting to boot
>>>>>>> 2021-03-22 17:01:52.370 7f6c9c883700  1 osd.127 253515 is_healthy false 
>>>>>>> -- only 0/10 up peers (less than 33%)
>>>>>>> 2021-03-22 17:01:52.370 7f6c9c883700  1 osd.127 253515 not healthy; 
>>>>>>> waiting to boot
>>>>>>> 2021-03-22 17:01:53.377 7f6c9c883700  1 osd.127 253515 is_healthy false 
>>>>>>> -- only 0/10 up peers (less than 33%)
>>>>>>> 2021-03-22 17:01:53.377 7f6c9c883700  1 osd.127 253515 not healthy; 
>>>>>>> waiting to boot
>>>>>>> 2021-03-22 17:01:54.385 7f6c9c883700  1 osd.127 253515 is_healthy false 
>>>>>>> -- only 0/10 up peers (less than 33%)
>>>>>>> 2021-03-22 17:01:54.385 7f6c9c883700  1 osd.127 253515 not healthy; 
>>>>>>> waiting to boot
>>>>>>> 2021-03-22 17:01:55.385 7f6c9c883700  1 osd.127 253515 is_healthy false 
>>>>>>> -- only 0/10 up peers (less than 33%)
>>>>>>> 2021-03-22 17:01:55.385 7f6c9c883700  1 osd.127 253515 not healthy; 
>>>>>>> waiting to boot
>>>>>>> 2021-03-22 17:01:56.362 7f6c9c883700  1 osd.127 253515 is_healthy false 
>>>>>>> -- only 0/10 up peers (less than 33%)
>>>>>>> 2021-03-22 17:01:56.362 7f6c9c883700  1 osd.127 253515 not healthy; 
>>>>>>> waiting to boot
>>>>>>> 2021-03-22 17:01:57.324 7f6c9c883700  1 osd.127 253515 is_healthy false 
>>>>>>> -- only 0/10 up peers (less than 33%)
>>>>>>> 2021-03-22 17:01:57.324 7f6c9c883700  1 osd.127 253515 not healthy; 
>>>>>>> waiting to boot
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Any suggestions?
>>>>>>>
>>>>>>> Sam
>>>>>>>
>>>>>>> P.S. an example ceph status as it is now [with everything now on 
>>>>>>> 14.2.18, since we had to restart osds anyway]:
>>>>>>>
>>>>>>>  cluster:
>>>>>>>     id:     a1148af2-6eaf-4486-a27e-a05a78c2b378
>>>>>>>     health: HEALTH_WARN
>>>>>>>             pauserd,pausewr,noout,nobackfill,norebalance flag(s) set
>>>>>>>             230 osds down
>>>>>>>             4 hosts (80 osds) down
>>>>>>>             Reduced data availability: 2048 pgs inactive
>>>>>>>             8 slow ops, oldest one blocked for 901 sec, mon.cephs01 has 
>>>>>>> slow ops
>>>>>>>
>>>>>>>   services:
>>>>>>>     mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 2h)
>>>>>>>     mgr: cephs01(active, since 77m)
>>>>>>>     osd: 329 osds: 98 up (since 4s), 328 in (since 4d)
>>>>>>>          flags pauserd,pausewr,noout,nobackfill,norebalance
>>>>>>>
>>>>>>>   data:
>>>>>>>     pools:   3 pools, 2048 pgs
>>>>>>>     objects: 0 objects, 0 B
>>>>>>>     usage:   0 B used, 0 B / 0 B avail
>>>>>>>     pgs:     100.000% pgs unknown
>>>>>>>              2048 unknown
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 22 Mar 2021 at 14:57, Dan van der Ster <[email protected]> 
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I would unset nodown (hiding osd failures) and norecover (blcoking PGs
>>>>>>>> from recovering degraded objects), then start starting osds.
>>>>>>>> As soon as you have some osd logs reporting some failures, then share 
>>>>>>>> those...
>>>>>>>>
>>>>>>>> - Dan
>>>>>>>>
>>>>>>>> On Mon, Mar 22, 2021 at 3:49 PM Sam Skipsey <[email protected]> wrote:
>>>>>>>> >
>>>>>>>> > So, we started the mons and mgr up again, and here's the relevant 
>>>>>>>> > logs, including also ceph versions. We've also turned off all of the 
>>>>>>>> > firewalls on all of the nodes so we know that there can't be network 
>>>>>>>> > issues [and, indeed, all of our management of the OSDs happens via 
>>>>>>>> > logins from the service nodes or to each other]
>>>>>>>> >
>>>>>>>> > > ceph status
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >   cluster:
>>>>>>>> >     id:     a1148af2-6eaf-4486-a27e-a05a78c2b378
>>>>>>>> >     health: HEALTH_WARN
>>>>>>>> >             
>>>>>>>> > pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover 
>>>>>>>> > flag(s) set
>>>>>>>> >             1 nearfull osd(s)
>>>>>>>> >             3 pool(s) nearfull
>>>>>>>> >             Reduced data availability: 2048 pgs inactive
>>>>>>>> >             mons cephs01,cephs02,cephs03 are using a lot of disk 
>>>>>>>> > space
>>>>>>>> >
>>>>>>>> >   services:
>>>>>>>> >     mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 61s)
>>>>>>>> >     mgr: cephs01(active, since 76s)
>>>>>>>> >     osd: 329 osds: 329 up (since 63s), 328 in (since 4d); 466 
>>>>>>>> > remapped pgs
>>>>>>>> >          flags 
>>>>>>>> > pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover
>>>>>>>> >
>>>>>>>> >   data:
>>>>>>>> >     pools:   3 pools, 2048 pgs
>>>>>>>> >     objects: 0 objects, 0 B
>>>>>>>> >     usage:   0 B used, 0 B / 0 B avail
>>>>>>>> >     pgs:     100.000% pgs unknown
>>>>>>>> >              2048 unknown
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > > ceph health detail
>>>>>>>> >
>>>>>>>> > HEALTH_WARN 
>>>>>>>> > pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover 
>>>>>>>> > flag(s) set; 1 nearfull osd(s); 3 pool(s) nearfull; Reduced data 
>>>>>>>> > availability: 2048 pgs inactive; mons cephs01,cephs02,cephs03 are 
>>>>>>>> > using a lot of disk space
>>>>>>>> > OSDMAP_FLAGS 
>>>>>>>> > pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover 
>>>>>>>> > flag(s) set
>>>>>>>> > OSD_NEARFULL 1 nearfull osd(s)
>>>>>>>> >     osd.63 is near full
>>>>>>>> > POOL_NEARFULL 3 pool(s) nearfull
>>>>>>>> >     pool 'dteam' is nearfull
>>>>>>>> >     pool 'atlas' is nearfull
>>>>>>>> >     pool 'atlas-localgroup' is nearfull
>>>>>>>> > PG_AVAILABILITY Reduced data availability: 2048 pgs inactive
>>>>>>>> >     pg 13.1ef is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 13.1f0 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 13.1f1 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 13.1f2 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 13.1f3 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 13.1f4 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 13.1f5 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 13.1f6 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 13.1f7 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 13.1f8 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 13.1f9 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 13.1fa is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 13.1fb is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 13.1fc is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 13.1fd is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 13.1fe is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 13.1ff is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 14.1ec is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 14.1f0 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 14.1f1 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 14.1f2 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 14.1f3 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 14.1f4 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 14.1f5 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 14.1f6 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 14.1f7 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 14.1f8 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 14.1f9 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 14.1fa is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 14.1fb is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 14.1fc is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 14.1fd is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 14.1fe is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 14.1ff is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 15.1ed is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 15.1f0 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 15.1f1 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 15.1f2 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 15.1f3 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 15.1f4 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 15.1f5 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 15.1f6 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 15.1f7 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 15.1f8 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 15.1f9 is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 15.1fa is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 15.1fb is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 15.1fc is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 15.1fd is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 15.1fe is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> >     pg 15.1ff is stuck inactive for 89.322981, current state 
>>>>>>>> > unknown, last acting []
>>>>>>>> > MON_DISK_BIG mons cephs01,cephs02,cephs03 are using a lot of disk 
>>>>>>>> > space
>>>>>>>> >     mon.cephs01 is 96 GiB >= mon_data_size_warn (15 GiB)
>>>>>>>> >     mon.cephs02 is 96 GiB >= mon_data_size_warn (15 GiB)
>>>>>>>> >     mon.cephs03 is 96 GiB >= mon_data_size_warn (15 GiB)
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > > ceph versions
>>>>>>>> >
>>>>>>>> > {
>>>>>>>> >     "mon": {
>>>>>>>> >         "ceph version 14.2.18 
>>>>>>>> > (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 3
>>>>>>>> >     },
>>>>>>>> >     "mgr": {
>>>>>>>> >         "ceph version 14.2.18 
>>>>>>>> > (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 1
>>>>>>>> >     },
>>>>>>>> >     "osd": {
>>>>>>>> >         "ceph version 14.2.10 
>>>>>>>> > (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)": 1,
>>>>>>>> >         "ceph version 14.2.15 
>>>>>>>> > (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)": 188,
>>>>>>>> >         "ceph version 14.2.16 
>>>>>>>> > (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)": 18,
>>>>>>>> >         "ceph version 14.2.18 
>>>>>>>> > (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 122
>>>>>>>> >     },
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > >>>>>>
>>>>>>>> >
>>>>>>>> > As a note, the log where the mgr explodes (which precipitated all of 
>>>>>>>> > this) definitely shows the problem occurring on the 12th [when 
>>>>>>>> > 14.2.17 dropped], but things didn't "break" until we tried upgrading 
>>>>>>>> > OSDs to 14.2.18...
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Sam
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Mon, 22 Mar 2021 at 12:20, Sam Skipsey <[email protected]> wrote:
>>>>>>>> >>
>>>>>>>> >> Hi Dan:
>>>>>>>> >>
>>>>>>>> >> Thanks for the reply - at present, our mons and mgrs are off 
>>>>>>>> >> [because of the unsustainable nature of the filesystem usage]. 
>>>>>>>> >> We'll try putting them on again for long enough to get "ceph 
>>>>>>>> >> status" out of them, but because the mgr was unable to actually 
>>>>>>>> >> talk to anything, and reply at that point.
>>>>>>>> >>
>>>>>>>> >> (And thanks for the link to the bug tracker - I guess this mismatch 
>>>>>>>> >> of expectations is why the devs are so keen to move to 
>>>>>>>> >> containerised deployments where there is no co-location of 
>>>>>>>> >> different types of server, as it means they don't need to worry as 
>>>>>>>> >> much about the assumptions about when it's okay to restart a 
>>>>>>>> >> service on package update. Disappointing that it seems stale after 
>>>>>>>> >> 2 years...)
>>>>>>>> >>
>>>>>>>> >> Sam
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> On Mon, 22 Mar 2021 at 12:11, Dan van der Ster 
>>>>>>>> >> <[email protected]> wrote:
>>>>>>>> >>>
>>>>>>>> >>> Hi Sam,
>>>>>>>> >>>
>>>>>>>> >>> The daemons restart (for *some* releases) because of this:
>>>>>>>> >>> https://tracker.ceph.com/issues/21672
>>>>>>>> >>> In short, if the selinux module changes, and if you have selinux
>>>>>>>> >>> enabled, then midway through yum update, there will be a systemctl
>>>>>>>> >>> restart ceph.target issued.
>>>>>>>> >>>
>>>>>>>> >>> For the rest -- I think you should focus on getting the PGs all
>>>>>>>> >>> active+clean as soon as possible, because the degraded and remapped
>>>>>>>> >>> states are what leads to mon / osdmap growth.
>>>>>>>> >>> This kind of scenario is why we wrote this tool:
>>>>>>>> >>> https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py
>>>>>>>> >>> It will use pg-upmap-items to force the PGs to the OSDs where they 
>>>>>>>> >>> are
>>>>>>>> >>> currently residing.
>>>>>>>> >>>
>>>>>>>> >>> But there is some clarification needed before you go ahead with 
>>>>>>>> >>> that.
>>>>>>>> >>> Could you share `ceph status`, `ceph health detail`?
>>>>>>>> >>>
>>>>>>>> >>> Cheers, Dan
>>>>>>>> >>>
>>>>>>>> >>>
>>>>>>>> >>> On Mon, Mar 22, 2021 at 12:05 PM Sam Skipsey <[email protected]> 
>>>>>>>> >>> wrote:
>>>>>>>> >>> >
>>>>>>>> >>> > Hi everyone:
>>>>>>>> >>> >
>>>>>>>> >>> > I posted to the list on Friday morning (UK time), but apparently 
>>>>>>>> >>> > my email
>>>>>>>> >>> > is still in moderation (I have an email from the list bot 
>>>>>>>> >>> > telling me that
>>>>>>>> >>> > it's held for moderation but no updates).
>>>>>>>> >>> >
>>>>>>>> >>> > Since this is a bit urgent - we have ~3PB of storage offline - 
>>>>>>>> >>> > I'm posting
>>>>>>>> >>> > again.
>>>>>>>> >>> >
>>>>>>>> >>> > To save retyping the whole thing, I will direct you to a copy of 
>>>>>>>> >>> > the email
>>>>>>>> >>> > I wrote on Friday:
>>>>>>>> >>> >
>>>>>>>> >>> > http://aoanla.pythonanywhere.com/Logs/EmailToCephUsers.txt
>>>>>>>> >>> >
>>>>>>>> >>> > (Since that was sent, we did successfully add big SSDs to the 
>>>>>>>> >>> > MON hosts so
>>>>>>>> >>> > they don't fill up their disks with store.db s).
>>>>>>>> >>> >
>>>>>>>> >>> > I would appreciate any advice - assuming this also doesn't get 
>>>>>>>> >>> > stuck in
>>>>>>>> >>> > moderation queues.
>>>>>>>> >>> >
>>>>>>>> >>> > --
>>>>>>>> >>> > Sam Skipsey (he/him, they/them)
>>>>>>>> >>> > _______________________________________________
>>>>>>>> >>> > ceph-users mailing list -- [email protected]
>>>>>>>> >>> > To unsubscribe send an email to [email protected]
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> --
>>>>>>>> >> Sam Skipsey (he/him, they/them)
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > --
>>>>>>>> > Sam Skipsey (he/him, they/them)
>>>>>>>> >
>>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sam Skipsey (he/him, they/them)
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sam Skipsey (he/him, they/them)
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Sam Skipsey (he/him, they/them)
>>>
>>>
>
>
> --
> Sam Skipsey (he/him, they/them)
>
>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Advice needed: stuck cluster halfway upgraded, comms issues and MON space usage

Reply via email to