I don't understand a single use case where I want updating my packages
using yum, apt, etc to restart a ceph daemon.  ESPECIALLY when there are so
many clusters out there with multiple types of daemons running on the same
server.

My home setup is 3 nodes each running 3 OSDs, a MON, and an MDS server.  If
upgrading the packages restarts all of those daemons at once, then I'm
mixing MON versions, OSD versions and MDS versions every time I upgrade my
cluster.  It removes my ability to methodically upgrade my MONs, OSDs, and
then clients.

Now let's take the Luminous upgrade which REQUIRES you to upgrade all of
your MONs before anything else... I'm screwed.  I literally can't perform
the upgrade if it's going to restart all of my daemons because it is
impossible for me to achieve a paxos quorum of MONs running the Luminous
binaries BEFORE I upgrade any other daemon in the cluster.  The only way to
achieve that is to stop the entire cluster and every daemon, upgrade all of
the packages, then start the mons, then start the rest of the cluster
again... There is no way that is a desired behavior.

All of this is ignoring large clusters using something like Puppet to
manage their package versions.  I want to just be able to update the ceph
version and push that out to the cluster.  It will install the new packages
to the entire cluster and then my automated scripts can perform a rolling
restart of the cluster upgrading all of the daemons while ensuring that the
cluster is healthy every step of the way.  I don't want to add in the time
of installing the packages on every node DURING the upgrade.  I want that
done before I initiate my script to be in a mixed version state as little
as possible.

Claiming that having anything other than an issued command to specifically
restart a Ceph daemon is anything but a bug and undesirable sounds crazy to
me.  I don't ever want anything restarting my Ceph daemons that is not
explicitly called to do so.  That just sounds like it's begging to put my
entire cluster into a world of hurt by accidentally restarting too many
daemons at the same time making the data in my cluster inaccessible.

I'm used to the Ubuntu side of things.  I've never seen upgrading the Ceph
packages to ever affect a daemon before.  If that's actually a thing that
is done on purpose in RHEL and CentOS... good riddance! That's ridiculous!

On Fri, Sep 15, 2017 at 6:06 PM Vasu Kulkarni <[email protected]> wrote:

> On Fri, Sep 15, 2017 at 2:10 PM, David Turner <[email protected]>
> wrote:
> > I'm glad that worked for you to finish the upgrade.
> >
> > He has multiple MONs, but all of them are on nodes with OSDs as well.
> When
> > he updated the packages on the first node, it restarted the MON and all
> of
> > the OSDs.  This is strictly not supported in the Luminous upgrade as the
> > OSDs can't be running Luminous code until all of the MONs are running
> > Luminous.  I have never seen updating Ceph packages cause a restart of
> the
> > daemons because you need to schedule the restarts and wait until the
> cluster
> > is back to healthy before restarting the next node to upgrade the
> daemons.
> > If upgrading the packages is causing a restart of the Ceph daemons, it is
> > most definitely a bug and needs to be fixed.
>
> The current spec file tell that unless CEPH_AUTO_RESTART_ON_UPGRADE is
> set to "yes", it shoudn't restart, but I remember
> it does restart in my own testing as well. Although I see no harm
> since the underlying binaries have changed and for the cluster
> in redundant mode restarting of service shoudn't cause any issue. But
> maybe its still useful for some use cases.
>
>
> >
> > On Fri, Sep 15, 2017 at 4:48 PM David <[email protected]> wrote:
> >>
> >> Happy to report I got everything up to Luminous, used your tip to keep
> the
> >> OSDs running, David, thanks again for that.
> >>
> >> I'd say this is a potential gotcha for people collocating MONs. It
> appears
> >> that if you're running selinux, even in permissive mode, upgrading the
> >> ceph-selinux packages forces a restart on all the OSDs. You're left
> with a
> >> load of OSDs down that you can't start as you don't have a Luminous mon
> >> quorum yet.
> >>
> >>
> >> On 15 Sep 2017 4:54 p.m., "David" <[email protected]> wrote:
> >>
> >> Hi David
> >>
> >> I like your thinking! Thanks for the suggestion. I've got a maintenance
> >> window later to finish the update so will give it a try.
> >>
> >>
> >> On Thu, Sep 14, 2017 at 6:24 PM, David Turner <[email protected]>
> >> wrote:
> >>>
> >>> This isn't a great solution, but something you could try.  If you stop
> >>> all of the daemons via systemd and start them all in a screen as a
> manually
> >>> running daemon in the foreground of each screen... I don't think that
> yum
> >>> updating the packages can stop or start the daemons.  You could copy
> and
> >>> paste the running command (viewable in ps) to know exactly what to run
> in
> >>> the screens to start the daemons like this.
> >>>
> >>> On Wed, Sep 13, 2017 at 6:53 PM David <[email protected]> wrote:
> >>>>
> >>>> Hi All
> >>>>
> >>>> I did a Jewel -> Luminous upgrade on my dev cluster and it went very
> >>>> smoothly.
> >>>>
> >>>> I've attempted to upgrade on a small production cluster but I've hit a
> >>>> snag.
> >>>>
> >>>> After installing the ceph 12.2.0 packages with "yum install ceph" on
> the
> >>>> first node and accepting all the dependencies, I found that all the
> OSD
> >>>> daemons, the MON and the MDS running on that node were terminated.
> Systemd
> >>>> appears to have attempted to restart them all but the daemons didn't
> start
> >>>> successfully (not surprising as first stage of upgrading all mons in
> cluster
> >>>> not completed). I was able to start the MON and it's running. The
> OSDs are
> >>>> all down and I'm reluctant to attempt to start them without upgrading
> the
> >>>> other MONs in the cluster. I'm also reluctant to attempt upgrading the
> >>>> remaining 2 MONs without understanding what happened.
> >>>>
> >>>> The cluster is on Jewel 10.2.5 (as was the dev cluster)
> >>>> Both clusters running on CentOS 7.3
> >>>>
> >>>> The only obvious difference I can see between the dev and production
> is
> >>>> the production has selinux running in permissive mode, the dev had it
> >>>> disabled.
> >>>>
> >>>> Any advice on how to proceed at this point would be much appreciated.
> >>>> The cluster is currently functional, but I have 1 node out 4 with all
> OSDs
> >>>> down. I had noout set before the upgrade and I've left it set for now.
> >>>>
> >>>> Here's the journalctl right after the packages were installed
> (hostname
> >>>> changed):
> >>>>
> >>>> https://pastebin.com/fa6NMyjG
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> [email protected]
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> [email protected]
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>
> >>
> >
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to