Re: [ceph-users] upgrade procedure to Luminous
On 2017-07-14T15:18:54, Sage Weilwrote: > Yes, but how many of those clusters can only upgrade by updating the > packages and rebooting? Our documented procedures have always recommended > upgrading the packages, then restarting either mons or osds first and to > my recollection nobody has complained. TBH my first encounter with the > "reboot on upgrade" procedure in the Linux world was with Fedora (which I > just recently switched to for my desktop)--and FWIW it felt very > anachronistic. Admittedly, it is. This is my main reason for hoping for containers. My main issue is not that they must be rebooted. In most cases, ceph-mon can be restarted. My fear is that they *might* be rebooted by a failure during that time, and it'd have been my expectation that normal operation does not expose Ceph to such degraded scenarios. Ceph is, after all, supposedly at least tolerant of one fault at a time. And I'd obviously have considered upgrades a normal operation, not a critical phase. If one considers upgrades an operation that degrades redundancy, sure, the current behaviour is in line. > won't see something we haven't. It also means, in this case, that we can > rip out out a ton of legacy code in luminous without having to keep > compatibility workarounds in place for another whole LTS cycle (a year!). Seriously, welcome to the world of enterprise software and customer expectations ;-) 1 year! I wish! ;-) > True, but this is rare, and even so the worst that can happen in this > case is the OSDs don't come up until the other mons are upgrade. If the > admin plans to upgrade the mons in succession without lingering with > mixed-versions mon the worst-case downtime window is very small--and only > kicks in if *more than one* of the mon nodes fails (taking out OSDs in > more than one failure domain). This is an interesting design philosophy in a fault tolerant distributed system. > > And customers don't always upgrade all nodes at once in a short period > > (the benefit of a supposed rolling upgrade cycle), increasing the risk. > I think they should plan to do this for the mons. We can make a note > stating as much in the upgrade procedure docs? Yes, we'll have to orchestrate this accordingly. Upgrade all MONs; restart all MONs (while warning users that this is a critical time period); start rebooting for the kernel/glibc updates. > Anyway, does that make sense? Yes, it means that you can't just reboot in > succession if your mons are mixed with OSDs. But this time adding that > restriction let us do the SnapSet and snapdir conversion in a single > release, which is a *huge* win and will let us rip out a bunch of ugly OSD > code. We might not have a need for it next time around (and can try to > avoid it), but I'm guessing something will come up and it will again be a > hard call to make balancing between sloppy/easy upgrades vs simpler > code... The next major transition probably will be from non-containerized L to fully-containerized N(autilus?). That'll be a fascinating can of worms anyway. But that would *really* benefit if nodes could be more easily redeployed and not just restarting daemon processes. Thanks, at least now we know this is intentional. That was helpful, at least! -- Architect SDS SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] upgrade procedure to Luminous
On Fri, 14 Jul 2017, Lars Marowsky-Bree wrote: > On 2017-07-14T14:12:08, Sage Weilwrote: > > > > Any thoughts on how to mitigate this, or on whether I got this all wrong > > > and > > > am missing a crucial detail that blows this wall of text away, please let > > > me > > > know. > > I don't know; the requirement that mons be upgraded before OSDs doesn't > > seem that unreasonable to me. That might be slightly more painful in a > > hyperconverged scenario (osds and mons on the same host), but it should > > just require some admin TLC (restart mon daemons instead of > > rebooting). > > I think it's quite unreasonable, to be quite honest. Collocated MONs > with OSDs is very typical for smaller cluster environments. Yes, but how many of those clusters can only upgrade by updating the packages and rebooting? Our documented procedures have always recommended upgrading the packages, then restarting either mons or osds first and to my recollection nobody has complained. TBH my first encounter with the "reboot on upgrade" procedure in the Linux world was with Fedora (which I just recently switched to for my desktop)--and FWIW it felt very anachronistic. But regardless, the real issue is this is a trade-off between the testing and software complexity burden vs user flexibility. Enforcing an upgrade order means we have less to test and have greater confidence the user won't see something we haven't. It also means, in this case, that we can rip out out a ton of legacy code in luminous without having to keep compatibility workarounds in place for another whole LTS cycle (a year!). That reduces code complexity, improves quality, and improves velocity. The downside is that the upgrade procedures has to be done in a particular order. Honestly, though, I think it is a good idea for operators to be careful with their upgrades anyway. They should upgrade just mons, let cluster stabilize, and make sure things are okay (e.g., no new health warnings saying they have to 'ceph osd set sortbitwise') before continuing. Also, although I think it's a good idea to do the mon upgrade relatively quickly (one after the other until they are upgraded), the OSD upgrade can be stretched out longer. (We do pretty thorough thrashing tests with mixed-version OSD clusters, but go through the mon upgrades pretty quickly.) > > Is there something in some distros that *requires* a reboot in order to > > upgrade packages? > > Not necessarily. > > *But* once we've upgraded the packages, a failure or reboot might > trigger this. True, but this is rare, and even so the worst that can happen in this case is the OSDs don't come up until the other mons are upgrade. If the admin plans to upgrade the mons in succession without lingering with mixed-versions mon the worst-case downtime window is very small--and only kicks in if *more than one* of the mon nodes fails (taking out OSDs in more than one failure domain). > And customers don't always upgrade all nodes at once in a short period > (the benefit of a supposed rolling upgrade cycle), increasing the risk. I think they should plan to do this for the mons. We can make a note stating as much in the upgrade procedure docs? > I wish we'd already be fully containerized so indeed the MONs were truly > independent of everything else going on on the cluster, but ... Indeed! Next time around... > > Also, this only seems like it will affect users that are getting their > > ceph packages from the distro itself and not from a ceph.com channel or a > > special subscription/product channel (this is how the RHEL stuff works, I > > think). > > Even there, upgrading only the MON daemons and not the OSDs is tricky? I mean you would upgrade all of the packages, but only restart the mon daemons. The deb packages have skipped the auto-restart in the postinst (or whatever) stage for years. I'm pretty sure the rpms do the same? Anyway, does that make sense? Yes, it means that you can't just reboot in succession if your mons are mixed with OSDs. But this time adding that restriction let us do the SnapSet and snapdir conversion in a single release, which is a *huge* win and will let us rip out a bunch of ugly OSD code. We might not have a need for it next time around (and can try to avoid it), but I'm guessing something will come up and it will again be a hard call to make balancing between sloppy/easy upgrades vs simpler code... sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] upgrade procedure to Luminous
It was required for Bobtail to Cuttlefish and Cuttlefish to Dumpling. Exactly how many mons do you have such that you are concerned about failure? If you have let’s say 3 mons, you update all the bits, then it shouldn’t take you more than 2 minutes to restart the mons one by one. You can take your time updating/restarting the osd’s. I generally consider it bad practice to save your system updates for a major ceph upgrade. How exactly can you parse the difference between a ceph bug and a kernel regression if you do them all at once? You have a resilient system why wouldn’t you take advantage of that property to change one thing at a time? So what we are really talking about here is a hardware failure in the short period it takes to restart mon services because you shouldn’t be rebooting. If the ceph mon doesn’t come back from a restart then you have a bug which in all likelihood will happen on the first mon and at that point you have options to roll back or run with degraded mons until Sage et al puts out a fix. My only significant downtime was due to a bug in a new release having to do with pg splitting, 8 hours later I had my fix. > On Jul 14, 2017, at 10:39 AM, Lars Marowsky-Breewrote: > > On 2017-07-14T10:34:35, Mike Lowe wrote: > >> Having run ceph clusters in production for the past six years and upgrading >> from every stable release starting with argonaut to the next, I can honestly >> say being careful about order of operations has not been a problem. > > This requirement did not exist as a mandatory one for previous releases. > > The problem is not the sunshine-all-is-good path. It's about what to do > in case of failures during the upgrade process. > > > > -- > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB > 21284 (AG Nürnberg) > "Experience is the name everyone gives to their mistakes." -- Oscar Wilde > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] upgrade procedure to Luminous
On 07/14/2017 03:12 PM, Sage Weil wrote: On Fri, 14 Jul 2017, Joao Eduardo Luis wrote: On top of this all, I found during my tests that any OSD, running luminous prior to the luminous quorum, will need to be restarted before it can properly boot into the cluster. I'm guessing this is a bug rather than a feature though. That sounds like a bug.. probably didn't subscribe to map updates from _start_boot() or something. Can you open an immediate ticket? http://tracker.ceph.com/issues/20631 -Joao ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] upgrade procedure to Luminous
On 2017-07-14T10:34:35, Mike Lowewrote: > Having run ceph clusters in production for the past six years and upgrading > from every stable release starting with argonaut to the next, I can honestly > say being careful about order of operations has not been a problem. This requirement did not exist as a mandatory one for previous releases. The problem is not the sunshine-all-is-good path. It's about what to do in case of failures during the upgrade process. -- SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] upgrade procedure to Luminous
Having run ceph clusters in production for the past six years and upgrading from every stable release starting with argonaut to the next, I can honestly say being careful about order of operations has not been a problem. > On Jul 14, 2017, at 10:27 AM, Lars Marowsky-Breewrote: > > On 2017-07-14T14:12:08, Sage Weil wrote: > >>> Any thoughts on how to mitigate this, or on whether I got this all wrong and >>> am missing a crucial detail that blows this wall of text away, please let me >>> know. >> I don't know; the requirement that mons be upgraded before OSDs doesn't >> seem that unreasonable to me. That might be slightly more painful in a >> hyperconverged scenario (osds and mons on the same host), but it should >> just require some admin TLC (restart mon daemons instead of >> rebooting). > > I think it's quite unreasonable, to be quite honest. Collocated MONs > with OSDs is very typical for smaller cluster environments. > >> Is there something in some distros that *requires* a reboot in order to >> upgrade packages? > > Not necessarily. > > *But* once we've upgraded the packages, a failure or reboot might > trigger this. > > And customers don't always upgrade all nodes at once in a short period > (the benefit of a supposed rolling upgrade cycle), increasing the risk. > > I wish we'd already be fully containerized so indeed the MONs were truly > independent of everything else going on on the cluster, but ... > >> Also, this only seems like it will affect users that are getting their >> ceph packages from the distro itself and not from a ceph.com channel or a >> special subscription/product channel (this is how the RHEL stuff works, I >> think). > > Even there, upgrading only the MON daemons and not the OSDs is tricky? > > > > > -- > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB > 21284 (AG Nürnberg) > "Experience is the name everyone gives to their mistakes." -- Oscar Wilde > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] upgrade procedure to Luminous
On 2017-07-14T14:12:08, Sage Weilwrote: > > Any thoughts on how to mitigate this, or on whether I got this all wrong and > > am missing a crucial detail that blows this wall of text away, please let me > > know. > I don't know; the requirement that mons be upgraded before OSDs doesn't > seem that unreasonable to me. That might be slightly more painful in a > hyperconverged scenario (osds and mons on the same host), but it should > just require some admin TLC (restart mon daemons instead of > rebooting). I think it's quite unreasonable, to be quite honest. Collocated MONs with OSDs is very typical for smaller cluster environments. > Is there something in some distros that *requires* a reboot in order to > upgrade packages? Not necessarily. *But* once we've upgraded the packages, a failure or reboot might trigger this. And customers don't always upgrade all nodes at once in a short period (the benefit of a supposed rolling upgrade cycle), increasing the risk. I wish we'd already be fully containerized so indeed the MONs were truly independent of everything else going on on the cluster, but ... > Also, this only seems like it will affect users that are getting their > ceph packages from the distro itself and not from a ceph.com channel or a > special subscription/product channel (this is how the RHEL stuff works, I > think). Even there, upgrading only the MON daemons and not the OSDs is tricky? -- SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] upgrade procedure to Luminous
On 07/14/2017 03:12 PM, Sage Weil wrote: On Fri, 14 Jul 2017, Joao Eduardo Luis wrote: Dear all, The current upgrade procedure to jewel, as stated by the RC's release notes, You mean (jewel or kraken) -> luminous, I assume... Yeah. *sigh* -Joao ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] upgrade procedure to Luminous
On Fri, 14 Jul 2017, Joao Eduardo Luis wrote: > Dear all, > > > The current upgrade procedure to jewel, as stated by the RC's release notes, You mean (jewel or kraken) -> luminous, I assume... > can be boiled down to > > - upgrade all monitors first > - upgrade osds only after we have a **full** quorum, comprised of all the > monitors in the monmap, of luminous monitors (i.e., once we have the > 'luminous' feature enabled in the monmap). > > While this is a reasonable idea in principle, reducing a lot of the possible > upgrade testing combinations, and a simple enough procedure from Ceph's > point-of-view, it seems it's not a widespread upgrade procedure. > > As far as I can tell, it's not uncommon for users to take this maintenance > window to perform system-wide upgrades, including kernel and glibc for > instance, and finishing the upgrade with a reboot. > > The problem with our current upgrade procedure is that once the first server > reboots, the osds in that server will be unable to boot, as the monitor quorum > is not yet 'luminous'. > > The only way to minimize potential downtime is to upgrade and restart all the > nodes at the same time, which can be daunting and it basically defeats the > purpose of a rolling upgrade. And in this scenario, there is an expectation of > downtime, something Ceph is built to prevent. > > Additionally, requiring the `luminous` feature to be enabled in the quorum > becomes even less realistic in the face of possible failures. God forbid that > in the middle of upgrading, the last remaining monitor server dies a horrible > death - e.g., power, network. We'll be left with still a 'not-luminous' > quorum, and a bunch of OSDs waiting for this flag to be flipped. And not it's > a race to either get that monitor up, or remove it from the monmap. > > Even if one were to make the decision of only upgrading system packages, > reboot, and then upgrade Ceph packages, there is the unfortunate possibility > that library interdependencies would require Ceph's binaries to be updated, so > this may be a show-stopper as well. > > Alternatively, if one is to simply upgrade the system and not reboot, and then > proceed to perform the upgrade procedure, one would still be in a fragile > position: if, for some reason, one of the nodes reboots, we're in the same > precarious situation as before. > > Personally, I can see two ways out of this, at different positions in the > reasonability spectrum: > > 1. add temporary monitor nodes to the cluster, may they be on VMs or bare > hardware, already running Luminous, and then remove the same amount of > monitors from the cluster. This leaves us to upgrade a single monitor node. > This has the drawback of folks not having spare nodes to run the monitors on, > or running monitors on VMs -- which may affect their performance during the > upgrade window, and increase complexity in terms of firewall and routing > rules. > > 2. migrate/upgrade all nodes on which Monitors are located first, then only > restart them after we've gotten all nodes upgraded. If anything goes wrong, > one can hurry through this step or fall-back to 3. > > 3. Reducing the monitor quorum to 1. This pains me to even think about, and it > bothers me to bits that I'm finding myself even considering this as a > reasonable possibility. It shouldn't, because it isn't. But it's a lot more > realistic than expecting OSD downtime during an upgrade procedure. > > On top of this all, I found during my tests that any OSD, running luminous > prior to the luminous quorum, will need to be restarted before it can properly > boot into the cluster. I'm guessing this is a bug rather than a feature > though. That sounds like a bug.. probably didn't subscribe to map updates from _start_boot() or something. Can you open an immediate ticket? > Any thoughts on how to mitigate this, or on whether I got this all wrong and > am missing a crucial detail that blows this wall of text away, please let me > know. I don't know; the requirement that mons be upgraded before OSDs doesn't seem that unreasonable to me. That might be slightly more painful in a hyperconverged scenario (osds and mons on the same host), but it should just require some admin TLC (restart mon daemons instead of rebooting). Also, for large clusters, users often have mons on dedicated hosts. And for small clusters even the slopppy "just reboot" approach will have a smaller impact. Is there something in some distros that *requires* a reboot in order to upgrade packages? Also, this only seems like it will affect users that are getting their ceph packages from the distro itself and not from a ceph.com channel or a special subscription/product channel (this is how the RHEL stuff works, I think). sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] upgrade procedure to Luminous
Dear all, The current upgrade procedure to jewel, as stated by the RC's release notes, can be boiled down to - upgrade all monitors first - upgrade osds only after we have a **full** quorum, comprised of all the monitors in the monmap, of luminous monitors (i.e., once we have the 'luminous' feature enabled in the monmap). While this is a reasonable idea in principle, reducing a lot of the possible upgrade testing combinations, and a simple enough procedure from Ceph's point-of-view, it seems it's not a widespread upgrade procedure. As far as I can tell, it's not uncommon for users to take this maintenance window to perform system-wide upgrades, including kernel and glibc for instance, and finishing the upgrade with a reboot. The problem with our current upgrade procedure is that once the first server reboots, the osds in that server will be unable to boot, as the monitor quorum is not yet 'luminous'. The only way to minimize potential downtime is to upgrade and restart all the nodes at the same time, which can be daunting and it basically defeats the purpose of a rolling upgrade. And in this scenario, there is an expectation of downtime, something Ceph is built to prevent. Additionally, requiring the `luminous` feature to be enabled in the quorum becomes even less realistic in the face of possible failures. God forbid that in the middle of upgrading, the last remaining monitor server dies a horrible death - e.g., power, network. We'll be left with still a 'not-luminous' quorum, and a bunch of OSDs waiting for this flag to be flipped. And not it's a race to either get that monitor up, or remove it from the monmap. Even if one were to make the decision of only upgrading system packages, reboot, and then upgrade Ceph packages, there is the unfortunate possibility that library interdependencies would require Ceph's binaries to be updated, so this may be a show-stopper as well. Alternatively, if one is to simply upgrade the system and not reboot, and then proceed to perform the upgrade procedure, one would still be in a fragile position: if, for some reason, one of the nodes reboots, we're in the same precarious situation as before. Personally, I can see two ways out of this, at different positions in the reasonability spectrum: 1. add temporary monitor nodes to the cluster, may they be on VMs or bare hardware, already running Luminous, and then remove the same amount of monitors from the cluster. This leaves us to upgrade a single monitor node. This has the drawback of folks not having spare nodes to run the monitors on, or running monitors on VMs -- which may affect their performance during the upgrade window, and increase complexity in terms of firewall and routing rules. 2. migrate/upgrade all nodes on which Monitors are located first, then only restart them after we've gotten all nodes upgraded. If anything goes wrong, one can hurry through this step or fall-back to 3. 3. Reducing the monitor quorum to 1. This pains me to even think about, and it bothers me to bits that I'm finding myself even considering this as a reasonable possibility. It shouldn't, because it isn't. But it's a lot more realistic than expecting OSD downtime during an upgrade procedure. On top of this all, I found during my tests that any OSD, running luminous prior to the luminous quorum, will need to be restarted before it can properly boot into the cluster. I'm guessing this is a bug rather than a feature though. Any thoughts on how to mitigate this, or on whether I got this all wrong and am missing a crucial detail that blows this wall of text away, please let me know. -Joao ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com