[ceph-users] Re: CEPH Version choice

Frank Schilder Mon, 15 May 2023 03:13:20 -0700

> What are the main reasons for not upgrading to the latest and greatest?

Because more often than not it isn't.

I guess when you write "latest and greatest" you talk about features. When we 
admins talk about "latest and greatest" we talk about stability. The times that 
one could jump with a production system onto a "stable" release with the ending 
.2 are long gone. Anyone who becomes an early adapter is more and more likely 
to experience serious issues. Which leads to more admins waiting with upgrades. 
Which in turn leads to more bugs discovered only at late releases. Which again 
makes more admins postpone an upgrade. A vicious cycle.

A long time ago there was a discussion about exactly this problem and the 
admins were pretty much in favor of increasing the release cadence to at least 
4 years if not longer. Its simply too many releases with too many serious bugs 
not fixed, lately not even during their official life time. Octopus still has 
serious bugs but is EOL.

I'm not surprised that admins give up on upgrading entirely and stay on a 
version until their system dies.

To give you one from my own experience, upgrading from mimic latest to octopus 
latest. This experience almost certainly applies to every upgrade that involves 
an OSD format change (the infamous "quick fix" that could take several days per 
OSD and crush entire clusters).

There is an OSD conversion involved in this upgrade and we found out that out 
of 2 possible upgrade paths, one leads to a heavily performance degraded 
cluster with no possibility to recover other than redeploying all OSDs step by 
step. Funnily enough, the problematic procedure is the one described in the 
documentation - it hasn't been updated until today despite users still getting 
caught in this trap.

To give you an idea of what amount of work is now involved in an attempt to 
avoid such pitfalls, here our path:

We set up a test cluster with a script producing realistic workload and started 
testing an upgrade under load. This took about a month (meaning repeating the 
upgrade with a cluster on mimic deployed and populated from scratch every time) 
to confirm that we managed to get onto a robust path avoiding a number of 
pitfalls along the way - mainly the serious performance degradation due to OSD 
conversion, but also an issue with stray entries plus noise. A month! Once we 
were convinced that it would work - meaning we did run it a couple of times 
without any further issues being discovered, we started upgrading our 
production cluster.

Went smooth until we started the OSD conversion of our FS meta data OSDs. They 
had a special performance optimized deployment resulting in a large number of 
100G OSDs with about 30-40% utilization. These OSDs started crashing with some 
weird corruption. Turns out - thanks Igor! - that while spill-over from fast to 
slow drive was handled, the other direction was not. Our OSDs crashed because 
Octopus apparently required substantially more space on the slow device and 
couldn't use the plenty of fast space that was actually available.

The whole thing ended in 3 days of complete downtime and me working 12 hour 
days on the weekend. We managed to recover from this only because we had a 
larger delivery of hardware already on-site and I could scavenge parts from 
there.

So, the story was that after 1 month of testing we still run into 3 days of 
downtime, because there was another unannounced change that broke a config that 
was working fine for years on mimic.

To say the same thing with different words: major version upgrades have become 
very disruptive and require a lot of effort to get halfway right. And I'm not 
talking about the deployment system here.

Add to this list the still open cases discussed on the list about MDS dentry 
corruption, snapshots disappearing/corrupting together with a lack of good 
built-in tools for detection and repair, performance degradation etc. all not 
even addressed in pacific. In this state the devs are pushing for pacific 
becoming EOL while at the same time the admins become ever more reluctant to 
upgrade.

In my specific case, I planned to upgrade at least to pacific this year, but my 
time budget simply doesn't allow for the verification of the procedure and 
checking that all for us relevant bugs have been addressed. I gave up. Maybe 
next year. Maybe then its even a bit closer to rock solid.

So to get back to my starting point, we admins actually value rock solid over 
features. I know that this is boring for devs, but nothing is worse than nobody 
using your latest and greatest - which probably was the motivation for your 
question. If the upgrade paths were more solid and things like the question 
"why does an OSD conversion not lead to an OSD that is identical to one 
deployed freshly" or "where does the performance go" would actually attempted 
to track down, we would be much less reluctant to upgrade.

And then, but only then, would the latest and greatest features be of interest.

I will bring it up here again: with the complexity that the code base reached 
now, the 2 year release cadence is way too fast, it doesn't provide sufficient 
maturity for upgrading fast as well. More and more admins will be several 
cycles behind and we are reaching the point where major bugs in so-called EOL 
versions will only be discovered before large clusters even reached this 
version. Which might become a fundamental blocker to upgrades entirely.

An alternative to increasing the release cadence would be to keep more cycles 
in the life-time loop instead of only the last 2 major releases. 4 years really 
is nothing when it comes to storage.

Hope this is helpful and puts some light on the mystery why admins don't want 
to move.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Konstantin Shalygin <[email protected]>
Sent: Monday, May 15, 2023 10:43 AM
To: Tino Todino
Cc: [email protected]
Subject: [ceph-users] Re: CEPH Version choice

Hi,

> On 15 May 2023, at 11:37, Tino Todino <[email protected]> wrote:
>
> What are the main reasons for not upgrading to the latest and greatest?

One of the main reasons - "just can't", because your Ceph-based products will 
get worse at real (not benchmark) performance, see [1]

[1] 
https://lists.ceph.io/hyperkitty/list/[email protected]/thread/2E67NW6BEAVITL4WTAAU3DFLW7LJX477/

k
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: CEPH Version choice

Reply via email to