MTU is the same across all hosts:
--------- cn01.ceph.la1.clx.corp---------
enp2s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 192.168.30.11 netmask 255.255.255.0 broadcast 192.168.30.255
inet6 fe80::3e8c:f8ff:feed:728d prefixlen 64 scopeid 0x20<link>
ether 3c:8c:f8:ed:72:8d txqueuelen 1000 (Ethernet)
RX packets 3163785 bytes 2136258888 (1.9 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 6890933 bytes 40233267272 (37.4 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
--------- cn02.ceph.la1.clx.corp---------
enp2s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 192.168.30.12 netmask 255.255.255.0 broadcast 192.168.30.255
inet6 fe80::3e8c:f8ff:feed:ff0c prefixlen 64 scopeid 0x20<link>
ether 3c:8c:f8:ed:ff:0c txqueuelen 1000 (Ethernet)
RX packets 3976256 bytes 2761764486 (2.5 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 9270324 bytes 56984933585 (53.0 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
--------- cn03.ceph.la1.clx.corp---------
enp2s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 192.168.30.13 netmask 255.255.255.0 broadcast 192.168.30.255
inet6 fe80::3e8c:f8ff:feed:feba prefixlen 64 scopeid 0x20<link>
ether 3c:8c:f8:ed:fe:ba txqueuelen 1000 (Ethernet)
RX packets 13081847 bytes 93614795356 (87.1 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 4001854 bytes 2536322435 (2.3 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
--------- cn04.ceph.la1.clx.corp---------
enp2s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 192.168.30.14 netmask 255.255.255.0 broadcast 192.168.30.255
inet6 fe80::3e8c:f8ff:feed:6f89 prefixlen 64 scopeid 0x20<link>
ether 3c:8c:f8:ed:6f:89 txqueuelen 1000 (Ethernet)
RX packets 60018 bytes 5622542 (5.3 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 59889 bytes 17463794 (16.6 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
--------- cn05.ceph.la1.clx.corp---------
enp2s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 192.168.30.15 netmask 255.255.255.0 broadcast 192.168.30.255
inet6 fe80::3e8c:f8ff:feed:7245 prefixlen 64 scopeid 0x20<link>
ether 3c:8c:f8:ed:72:45 txqueuelen 1000 (Ethernet)
RX packets 69163 bytes 8085511 (7.7 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 73539 bytes 17069869 (16.2 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
--------- cn06.ceph.la1.clx.corp---------
enp2s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 192.168.30.16 netmask 255.255.255.0 broadcast 192.168.30.255
inet6 fe80::3e8c:f8ff:feed:feab prefixlen 64 scopeid 0x20<link>
ether 3c:8c:f8:ed:fe:ab txqueuelen 1000 (Ethernet)
RX packets 23570 bytes 2251531 (2.1 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 22268 bytes 16186794 (15.4 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
10G.
On Mon, Jul 25, 2022 at 2:51 PM Sean Redmond <[email protected]>
wrote:
> Is the MTU in n the new rack set correctly?
>
> On Mon, 25 Jul 2022, 11:30 Jeremy Hansen, <[email protected]>
> wrote:
>
>> I transitioned some servers to a new rack and now I'm having major issues
>> with Ceph upon bringing things back up.
>>
>> I believe the issue may be related to the ceph nodes coming back up with
>> different IPs before VLANs were set. That's just a guess because I can't
>> think of any other reason this would happen.
>>
>> Current state:
>>
>> Every 2.0s: ceph -s
>> cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022
>>
>> cluster:
>> id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d
>> health: HEALTH_WARN
>> 1 filesystem is degraded
>> 2 MDSs report slow metadata IOs
>> 2/5 mons down, quorum cn02,cn03,cn01
>> 9 osds down
>> 3 hosts (17 osds) down
>> Reduced data availability: 97 pgs inactive, 9 pgs down
>> Degraded data redundancy: 13860144/30824413 objects degraded
>> (44.965%), 411 pgs degraded, 482 pgs undersized
>>
>> services:
>> mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum: cn05,
>> cn04
>> mgr: cn02.arszct(active, since 5m)
>> mds: 2/2 daemons up, 2 standby
>> osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped pgs
>>
>> data:
>> volumes: 1/2 healthy, 1 recovering
>> pools: 8 pools, 545 pgs
>> objects: 7.71M objects, 6.7 TiB
>> usage: 15 TiB used, 39 TiB / 54 TiB avail
>> pgs: 0.367% pgs unknown
>> 17.431% pgs not active
>> 13860144/30824413 objects degraded (44.965%)
>> 1137693/30824413 objects misplaced (3.691%)
>> 280 active+undersized+degraded
>> 67 undersized+degraded+remapped+backfilling+peered
>> 57 active+undersized+remapped
>> 45 active+clean+remapped
>> 44 active+undersized+degraded+remapped+backfilling
>> 18 undersized+degraded+peered
>> 10 active+undersized
>> 9 down
>> 7 active+clean
>> 3 active+undersized+remapped+backfilling
>> 2 active+undersized+degraded+remapped+backfill_wait
>> 2 unknown
>> 1 undersized+peered
>>
>> io:
>> client: 170 B/s rd, 0 op/s rd, 0 op/s wr
>> recovery: 168 MiB/s, 158 keys/s, 166 objects/s
>>
>> I have to disable and re-enable the dashboard just to use it. It seems to
>> get bogged down after a few moments.
>>
>> The three servers that were moved to the new rack Ceph has marked as
>> "Down", but if I do a cephadm host-check, they all seem to pass:
>>
>> ************************ ceph ************************
>> --------- cn01.ceph.---------
>> podman (/usr/bin/podman) version 4.0.2 is present
>> systemctl is present
>> lvcreate is present
>> Unit chronyd.service is enabled and running
>> Host looks OK
>> --------- cn02.ceph.---------
>> podman (/usr/bin/podman) version 4.0.2 is present
>> systemctl is present
>> lvcreate is present
>> Unit chronyd.service is enabled and running
>> Host looks OK
>> --------- cn03.ceph.---------
>> podman (/usr/bin/podman) version 4.0.2 is present
>> systemctl is present
>> lvcreate is present
>> Unit chronyd.service is enabled and running
>> Host looks OK
>> --------- cn04.ceph.---------
>> podman (/usr/bin/podman) version 4.0.2 is present
>> systemctl is present
>> lvcreate is present
>> Unit chronyd.service is enabled and running
>> Host looks OK
>> --------- cn05.ceph.---------
>> podman|docker (/usr/bin/podman) is present
>> systemctl is present
>> lvcreate is present
>> Unit chronyd.service is enabled and running
>> Host looks OK
>> --------- cn06.ceph.---------
>> podman (/usr/bin/podman) version 4.0.2 is present
>> systemctl is present
>> lvcreate is present
>> Unit chronyd.service is enabled and running
>> Host looks OK
>>
>> It seems to be recovering with what it has left, but a large amount of
>> OSDs
>> are down. When trying to restart one of the down'd OSDs, I see a huge
>> dump.
>>
>> Jul 25 03:19:38 cn06.ceph
>> ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
>> 2022-07-25T10:19:38.532+0000 7fce14a6c080 0 osd.34 30689 done with init,
>> starting boot process
>> Jul 25 03:19:38 cn06.ceph
>> ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
>> 2022-07-25T10:19:38.532+0000 7fce14a6c080 1 osd.34 30689 start_boot
>> Jul 25 03:20:10 cn06.ceph
>> ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
>> 2022-07-25T10:20:10.655+0000 7fcdfd12d700 1 osd.34 30689 start_boot
>> Jul 25 03:20:41 cn06.ceph
>> ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
>> 2022-07-25T10:20:41.159+0000 7fcdfd12d700 1 osd.34 30689 start_boot
>> Jul 25 03:21:11 cn06.ceph
>> ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
>> 2022-07-25T10:21:11.662+0000 7fcdfd12d700 1 osd.34 30689 start_boot
>>
>> At this point it just keeps printing start_boot, but the dashboard has it
>> marked as "in" but "down".
>>
>> On these three hosts that moved, there were a bunch marked as "out" and
>> "down", and some with "in" but "down".
>>
>> Not sure where to go next. I'm going to let the recovery continue and
>> hope
>> that my 4x replication on these pools saves me.
>>
>> Not sure where to go from here. Any help is very much appreciated. This
>> Ceph cluster holds all of our Cloudstack images... it would be terrible
>> to
>> lose this data.
>> _______________________________________________
>> ceph-users mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>>
>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]