[PVE-User] Ceph freeze upgrading from Hammer to Jewel - lessons learned

Eneko Lacunza Fri, 19 Oct 2018 01:06:01 -0700

Hi all,

Yesterday we performed a Ceph upgrade in a 3-node Proxmox 4.4 cluster,from Hammer to Jewel following the procedure in the wiki:

https://pve.proxmox.com/wiki/Ceph_Hammer_to_Jewel

It went smoothly for the first two nodes, but we had a grave problemwith the 3rd, because when shuting down OSDs on that node, only oneregistered as "down" in ceph monitors although all three OSDs on thatnode were effectively down (no process running!).

OSDs couldn't be started back directly, because we had to chown datafiles/directories and it took quite long (like 1 hour), so VMs trying towrite and read from those 2 phantom OSDs just freezed.

We had downtime, no data was lost, and managed to recover everythingback to working condition as fast as chown command finished :)


* Lessons learned:

I think the procedure described in the wiki could be improved, so it isinstructed first to stop Ceph Mons and OSDs, and only after that performapt-get update && apt-get upgrade . This way, it's possible to restartOSDs in case this bug? happens, or any other problem surfaces; withoutperforming the long chown work.


Cheers
Eneko


--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

_______________________________________________
pve-user mailing list
[email protected]
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

[PVE-User] Ceph freeze upgrading from Hammer to Jewel - lessons learned

Reply via email to