I was upgrading a really old cluster from Infernalis (9.2.1) to Jewel
(10.2.3) and got some weird, but interesting issues. This cluster
started its life with Bobtail -> Dumpling -> Emperor -> Firefly ->
Giant -> Hammer -> Infernalis and now Jewel.
When I upgraded the first MON (out of 3) everything just worked as it
should. Upgraded the second and the first and second crashed. Reverted
binaries on one of them to Infernalis, deleted the store.db folder in
the other one, started as Jewel (now had 2x Infernalis and 1x Jewel)
and let it sync the store. Upgraded the other nodes and every thing
was fine.
Or so it mostly seems. Other than the usual "failed to encode map xxx
with expected crc".
I had some weird size graphs in calamari, and looking closer (ceph df) I got:
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
10E 932P 5E 52.46
oooh I got a really big cluster, it's usually a lot smaller (size is 655T).
a snippet cut from "ceph -s"
health HEALTH_ERR
1 full osd(s)
flags full
pgmap v77393779: 6384 pgs, 26 pools, 66584 GB data, 52605 kobjects
5502 PB used, 17316 PB / 10488 PB avail
health detail shows: osd.89 is full at 266%
that is one of the OSD's that's being upgraded...
The cluster ends up recovering by its own, and showing the regular
sane values... But this does seem to indicate some sort of underlying
issue....
has anyone seen such an issue?
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com