On Tue, Dec 5, 2017 at 1:20 PM, Wido den Hollander <w...@42on.com> wrote:
> Hi,
>
> I haven't tried this before but I expect it to work, but I wanted to check 
> before proceeding.
>
> I have a Ceph cluster which is running with manually formatted FileStore XFS 
> disks, Jewel, sysvinit and Ubuntu 14.04.
>
> I would like to upgrade this system to Luminous, but since I have to 
> re-install all servers and re-format all disks I'd like to move it to 
> BlueStore at the same time.
>
> This system however has 768 3TB disks and has a utilization of about 60%. You 
> can guess, it will take a long time before all the backfills complete.
>
> The idea is to take a machine down, wipe all disks, re-install it with Ubuntu 
> 16.04 and Luminous and re-format the disks with BlueStore.
>
> The OSDs get back, start to backfill and we wait.
Are you OUT'ing the OSDs or removing them altogether (ceph osd crush
remove + ceph osd rm)?

I've noticed that when you remove them completely the data movement is
much bigger.

>
> My estimation is that we can do one machine per day, but we have 48 machines 
> to do. Realistically this will take ~60 days to complete.

That seems a bit optimistic for me. But it depends on how aggressive
you are, and how busy those spindles are.

>
> Afaik running Jewel (10.2.10) mixed with Luminous (12.2.2) should work just 
> fine I wanted to check if there are any caveats I don't know about.
>
> I'll upgrade the MONs to Luminous first before starting to upgrade the OSDs. 
> Between each machine I'll wait for a HEALTH_OK before proceeding allowing the 
> MONs to trim their datastore.

You have to: As far as I've seen after upgrading one of the MONs to
Luminous, the new OSDs running Luminous refuse to start until you have
*ALL* MONs running Luminous.

>
> The question is: Does it hurt to run Jewel and Luminous mixed for ~60 days?
>
> I think it won't, but I wanted to double-check.

I thought the same. I was running 10.2.3 and doing about the same to
upgrade to 10.2.7, so keeping Jewel. The process was pretty much the
same, but had to pause for a month half way through (because of
unrelated issues) and every so often the cluster would just stop. At
least one of the OSDs would stop responding and piling up slow
requests, even though it was idle. It was random OSDs and happened
both on HDD and SSD (this is a cache tiered s3 storage cluster) and
either version. I tried the injectargs but no output - it just printed
and if it was idle. Restart the OSD and it would spring back to
life...

So not sure if you get similar issues, but I'm now avoiding mixed
versions as much as I can.

>
> Wido
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to