If there is some obvious issue with replicated log then open() call would
fail and caused aurora to exist or restart itself. I am looking at
different issue - If there are 3 aurora instances that needs the update its
hard to tell right now at which point its safe to move from one instance to
another. Lets say there is rolling update going and applying update on each
aurora instance at the time. One instance is down and out of rotation. Once
its started and it can open log it won't crash and starts mesos-log
recovery. But if you start doing upgrade on 2nd instance before mesos-log
is replicated to first one its easy to loose quorum and data. I'd like to
have some deterministic check that would allow to ensure that its safe to
consider log replicated.

2016-06-17 16:05 GMT+02:00 Bill Farner <wfar...@apache.org>:

> If i recall correctly, the current implementation of the mesos log requires
> that the callers handle mutually-exclusive access for reads and writes.
> This means that non-leading schdulers may not read or write to perform the
> check you describe.
>
> What's the behavior of the scheduler when it starts and the log replica is
> non-VOTING?  I thought the log open() call would fail, and the scheduler
> process would exit (giving a strong signal that the scheduler is not
> healthy).
>
> On Fri, Jun 17, 2016 at 2:44 AM, Martin Hrabovčin <
> martin.hrabov...@gmail.com> wrote:
>
> > Hello,
> >
> > I was asking same question in #aurora channel and I still haven't found
> an
> > answer so I am bringing this in mailing list with a proposal.
> >
> > Is there a way to check the state of mesos-log (whether the its writable
> in
> > VOTING state) through some HTTP check outside of aurora process on a
> > non-leading aurora instance? We are trying to create external check that
> > would determine whether the mesos-log is ready in case of aurora rolling
> > update. When adding new instance to existing aurora cluster and we want
> to
> > make sure that mesos-log is replicated and replica is ready to serve
> reads
> > and writes. Currently we’re grep-ing java process log and looking for
> > “Persisted replica status to VOTING”.
> >
> > I was pointed to /vars endpoint but I haven't found obvious answer there.
> >
> > I'd like to propose creating new HTTP endpoint "/loghealth" that would
> > similarly to "/leaderhealth" return 200 when mesos-log is ready and 503
> in
> > case when mesos log throws exception. As for implementation I was
> thinking
> > about doing simple read from log or write noop to log directly.
> >
> > Thanks!
> >
>

Reply via email to