Re: support error->drop transition in helix

Santiago Perez Fri, 22 Mar 2013 05:37:35 -0700

I personally prefer the FATAL state approach. What do you think Jason?


On Fri, Mar 22, 2013 at 4:50 AM, kishore g <[email protected]> wrote:

> Hi Terence/Jason/Santi,
>
> Did we come to a conclusion on this. Terence proposal looks good to me. If
> adding FATAL state is more invasive, I suggest simply disabling the
> partition on that node and set some reason for disabling for
> auditing/diagnosis. The advantage of this is if the underlying error is
> rectified then one can enable the partition and transition ERROR->DROP will
> be invoked. Disabling ensures that even if node restarts it will not host
> that partition again.
>
> thanks,
> Kishore G
>
>
> On Mon, Feb 11, 2013 at 8:58 PM, Terence Yim <[email protected]> wrote:
>
> > I proposed the FATAL state to Kishore before. Let me write it down again
> > for discussion.
> >
> > 1. An extra state, "FATAL", is introduced. It is a system state, just
> like
> > the existing ERROR state, which doesn't need to be explicitly defined in
> > state model.
> > 2. Just like the current implementation, whenever there is any error
> during
> > participant state transition, transit the participant into ERROR state
> and
> > stay there.
> > 3. Also just like current implementation, when a given resource is
> deleted,
> > trigger state transition from CURRENT_STATE -> DROPPED (and goes through
> > necessary state transition based on the state model).
> > 4. For participants that have current state = ERROR, trigger
> ERROR->DROPPED
> > transition (can have a default callback in the StateModel that do nothing
> > in this transition, but it's up to further discussion).
> > 5. If and only if there is exception thrown during the ERROR->DROPPED
> > transition, transit the participant to FATAL state.
> > 6. When a participant gets into FATAL state, there is no way for it to
> get
> > out of it without human intervention, meaning a human need to inspect and
> > reset it manually (or through some tools).
> >
> > With this, there would be changes in Controller, but no change in
> > participant if there nothing to specially handled during ERROR->DROPPED
> > transition. Also, all error handling would be done with state transition,
> > which gives the participant more consistent way on handling different
> > scenarios. This also guarantees that every calls are sync and thread
> safe.
> >
> > Terence
> >
> > On Mon, Feb 11, 2013 at 7:23 PM, Santiago Perez <[email protected]
> > >wrote:
> >
> > > In my proposal FATAL would be a final state, manual intervention
> > required.
> > >
> > > 1) In our use case, the problem is that when a regular transition (say
> > > offline->online) fails and goes to error state. if then the resource
> gets
> > > removed, the participant remains in "ERROR" state so we can't reuse it
> > > because in order to reuse it we need to transit to dropped first.
> > > 2) The thing is, in our use case the drop comes from an api call which
> is
> > > not synchronized with the cluster management code which could issue the
> > > reset. Also, if we reset it, wouldn't the controller push the
> transitions
> > > trying to have reach the ideal state again (likely triggering the same
> > > issue that led to ERROR?)
> > >
> > > Thanks
> > > Santi
> > >
> > >
> > > On Mon, Feb 11, 2013 at 5:25 PM, Zhen Zhang <[email protected]>
> wrote:
> > >
> > > > If we are going to add a new FATAL state, we might potentially add
> > FATAL
> > > to
> > > > all state models and all applications might have to implement
> > > ERROR->FATAL
> > > > and FATAL->initial_state transitions.
> > > >
> > > > On the other hand, I have a couple of questions:
> > > > 1) why in your use case, ERROR state is inevitable?
> > > > 2) if a partition goes to ERROR state, could we reset it, so only
> error
> > > > partitions will get an ERROR->initial_state transition and then drop
> > it?
> > > If
> > > > no error happens during ERROR->initial_state, the error is
> recoverable,
> > > and
> > > > the resource will be dropped. otherwise, if something goes wrong with
> > > > ERROR->initial_state, participant remains in ERROR state, drop
> failed,
> > > and
> > > > the resource is not reusable?
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > > On Mon, Feb 11, 2013 at 1:47 PM, Santiago Perez <
> [email protected]
> > > > >wrote:
> > > >
> > > > > For our use case that's somewhat problematic. It's still better
> than
> > > the
> > > > > current inability to go from error to dropped but the problem is
> now
> > if
> > > > > something goes wrong when dropping there's no way to know that from
> > the
> > > > > participant states. And that's actually the only unrecoverable
> > > situation
> > > > > for our use case. Basically it means that the participant cannot be
> > > > reused
> > > > > for another purpose. An alternative solution would be to have a
> FATAL
> > > > state
> > > > > that is reached when a failure occurs when transitioning out of the
> > > ERROR
> > > > > state.
> > > > >
> > > > > Cheers,
> > > > > Santi
> > > > >
> > > > >
> > > > > On Wed, Feb 6, 2013 at 1:57 PM, Zhen Zhang <[email protected]>
> > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am going to add the support of  error->drop transition in
> Helix.
> > > The
> > > > > > basic idea is to remove DROPPED state from state model; instead
> we
> > > add
> > > > a
> > > > > > drop() (or cleanup()) abstract method in StateModel. Applications
> > > need
> > > > to
> > > > > > implement this abstract method to take care of the drop logic.
> This
> > > > > > requires no change on the controller side. On the participant
> side,
> > > > when
> > > > > > the participant receives a state-transition message with
> > > > ToState=DROPPED,
> > > > > > it will invoke the drop() method in the state model. When the
> > drop()
> > > > gets
> > > > > > executed, the partition will be removed from the current state
> > > > regardless
> > > > > > of any errors/exceptions during the execution of drop(). This
> will
> > > > > prevent
> > > > > > the infinite loop of calling drop() in case of error/exception in
> > the
> > > > > > execution of drop(). The advantage of this design is that we can
> > > remove
> > > > > > DROPPED state totally from all state model definitions, which
> keeps
> > > the
> > > > > > state model simple. The disadvantage is, in drop() the
> application
> > > need
> > > > > to
> > > > > > take different drop logics based on the current state (e.g.
> MASTER,
> > > > > SLAVE,
> > > > > > or ERROR, which will be the FromState in the message). Any
> > > suggestions?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jason
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: support error->drop transition in helix

Reply via email to