Re: [ceph-users] Trying to rescue a lost quorum

Martin B Nielsen Sat, 01 Mar 2014 08:52:05 -0800

Hi,

You can't form quorom with your monitors on cuttlefish if you're mixing <
0.61.5 with any 0.61.5+ ( https://ceph.com/docs/master/release-notes/ ) =>
section about 0.61.5.


I'll advice installing pre-0.61.5, form quorom and then upgrade to 0.61.9
(if needs be) - and then latest dumpling on top.

Cheers,
Martin


On Fri, Feb 28, 2014 at 2:09 AM, Marc <[email protected]> wrote:

> Hi,
>
> thanks for the reply. I updated one of the new mons. And after a
> resonably long init phase (inconsistent state), I am now seeing these:
>
> 2014-02-28 01:05:12.344648 7fe9d05cb700  0 cephx: verify_reply coudln't
> decrypt with error: error decoding block for decryption
> 2014-02-28 01:05:12.345599 7fe9d05cb700  0 -- X.Y.Z.207:6789/0 >>
> X.Y.Z.201:6789/0 pipe(0x14e1400 sd=21 :49082 s=1 pgs=5421935 cs=12
> l=0).failed verifying authorize reply
>
> with .207 being the updated mon and .201 being one of the "old" alive
> mons. I guess they don't understand each other? I would rather not try
> to update the mons running on servers that also host OSDs, especially
> since there seem to be communication issues between those versions... or
> am I reading this wrong?
>
> KR,
> Marc
>
> On 28/02/2014 01:32, Gregory Farnum wrote:
> > On Thu, Feb 27, 2014 at 4:25 PM, Marc <[email protected]> wrote:
> >> Hi,
> >>
> >> I was handed a Ceph cluster that had just lost quorum due to 2/3 mons
> >> (b,c) running out of disk space (using up 15GB each). We were trying to
> >> rescue this cluster without service downtime. As such we freed up some
> >> space to keep mon b running a while longer, which succeeded, quorum
> >> restored (a,b), mon c remained offline. Even though we have freed up
> >> some space on mon c's disk also, that mon just won't start. It's log
> >> file does say
> >>
> >> ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60), process
> >> ceph-mon, pid 27846
> >>
> >> and thats all she wrote. Even when starting ceph-mon with -d mind you.
> >>
> >> So we had a cluster with 2/3 mons up and wanted to add another mon since
> >> it was only a matter of time til mon b failed again due to disk space.
> >>
> >> As such I added mon.g to the cluster, which took a long while to sync,
> >> but now reports running.
> >>
> >> Then mon.h got added for the same reason. mon.h fails to start much the
> >> same as mon.c does.
> >>
> >> Still that should leave us with 3/5 mons up. However running "ceph
> >> daemon mon.{g,h} mon_status" on the respective node also blocks. The
> >> only output we get from those are fault messages.
> >>
> >> Ok so now mon.g apparantly crashed:
> >>
> >> 2014-02-28 00:11:48.861263 7f4728042700 -1 mon/Monitor.cc: In function
> >> 'void Monitor::sync_timeout(entity_inst_t&)' thread 7f4728042700 time
> >> 2014-02-28 00:11:48.782305 mon/Monitor.cc: 1099: FAILED
> >> assert(sync_state == SYNC_STATE_CHUNKS)
> >>
> >> ... and now blocks trying to start much like c and h.
> >>
> >> Long story short: is it possible to add .61.9 mons to a cluster running
> >> .61.2 on the 2 alive mons and all the osds? I'm guessing this is the
> >> last shot at trying to rescue the cluster without downtime.
> > That should be fine, and is likely (though not guaranteed) to resolve
> > your sync issues -- although it's pretty unfortunate that you're that
> > far behind on the point releases; they fixed a whole lot of sync
> > issues and related things and you might need to upgrade the existing
> > monitors too in order to get the fixes you need... :/
> > -Greg
> > Software Engineer #42 @ http://inktank.com | http://ceph.com
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Trying to rescue a lost quorum

Reply via email to