Re: support error->drop transition in helix

Terence Yim Fri, 22 Mar 2013 11:53:57 -0700

Sounds good to me too

Terence


On Mar 22, 2013, at 11:46 AM, kishore g <[email protected]> wrote:

> Looks good, lets make sure we add this to the javadoc of statemodel so that
> users know when those methods are invoked.
> 
> 
> On Fri, Mar 22, 2013 at 11:29 AM, Zhen Zhang <[email protected]> wrote:
> 
>> Hi, I am fine with FATAL state, but I think we should clearly separate
>> helix defined states from user defined states. Helix define states (i.e.
>> ERROR, DROPPED, FATAL) need not to be defined in state model and state
>> transitions logic involving helix defined states should be common to all
>> state models. In addition, helix should provide default implementation for
>> transitions involving helix defined states. In case applications don't care
>> about them, they don't implement these transitions. Here are what I am
>> thinking of:
>> 
>> - Helix will invoke StateModel.onError() if current state is any user
>> defined state and error occurs in the transition.
>> 
>> - Helix will invoke StateModel.drop() if current state is ERROR and target
>> state is DROPPED. If drop() succeeds, ERROR will transit to initial state
>> and then to DROPPED; otherwise to FATAL state.
>> 
>> - Helix will invoke StateModel.reset() if current state is FATAL and we
>> issue a reset command. If reset() succeeds, FATAL will transit to initial
>> state; otherwise remain in FATAL state. Also reset() should be invoked only
>> by admin commands, so in case reset() fails, we don't call it infinitely.
>> 
>> Thanks,
>> Jason
>> 
>> 
>> On Fri, Mar 22, 2013 at 5:36 AM, Santiago Perez <[email protected]
>>> wrote:
>> 
>>> I personally prefer the FATAL state approach. What do you think Jason?
>>> 
>>> 
>>> On Fri, Mar 22, 2013 at 4:50 AM, kishore g <[email protected]> wrote:
>>> 
>>>> Hi Terence/Jason/Santi,
>>>> 
>>>> Did we come to a conclusion on this. Terence proposal looks good to me.
>>> If
>>>> adding FATAL state is more invasive, I suggest simply disabling the
>>>> partition on that node and set some reason for disabling for
>>>> auditing/diagnosis. The advantage of this is if the underlying error is
>>>> rectified then one can enable the partition and transition ERROR->DROP
>>> will
>>>> be invoked. Disabling ensures that even if node restarts it will not
>> host
>>>> that partition again.
>>>> 
>>>> thanks,
>>>> Kishore G
>>>> 
>>>> 
>>>> On Mon, Feb 11, 2013 at 8:58 PM, Terence Yim <[email protected]> wrote:
>>>> 
>>>>> I proposed the FATAL state to Kishore before. Let me write it down
>>> again
>>>>> for discussion.
>>>>> 
>>>>> 1. An extra state, "FATAL", is introduced. It is a system state, just
>>>> like
>>>>> the existing ERROR state, which doesn't need to be explicitly defined
>>> in
>>>>> state model.
>>>>> 2. Just like the current implementation, whenever there is any error
>>>> during
>>>>> participant state transition, transit the participant into ERROR
>> state
>>>> and
>>>>> stay there.
>>>>> 3. Also just like current implementation, when a given resource is
>>>> deleted,
>>>>> trigger state transition from CURRENT_STATE -> DROPPED (and goes
>>> through
>>>>> necessary state transition based on the state model).
>>>>> 4. For participants that have current state = ERROR, trigger
>>>> ERROR->DROPPED
>>>>> transition (can have a default callback in the StateModel that do
>>> nothing
>>>>> in this transition, but it's up to further discussion).
>>>>> 5. If and only if there is exception thrown during the ERROR->DROPPED
>>>>> transition, transit the participant to FATAL state.
>>>>> 6. When a participant gets into FATAL state, there is no way for it
>> to
>>>> get
>>>>> out of it without human intervention, meaning a human need to inspect
>>> and
>>>>> reset it manually (or through some tools).
>>>>> 
>>>>> With this, there would be changes in Controller, but no change in
>>>>> participant if there nothing to specially handled during
>> ERROR->DROPPED
>>>>> transition. Also, all error handling would be done with state
>>> transition,
>>>>> which gives the participant more consistent way on handling different
>>>>> scenarios. This also guarantees that every calls are sync and thread
>>>> safe.
>>>>> 
>>>>> Terence
>>>>> 
>>>>> On Mon, Feb 11, 2013 at 7:23 PM, Santiago Perez <
>> [email protected]
>>>>>> wrote:
>>>>> 
>>>>>> In my proposal FATAL would be a final state, manual intervention
>>>>> required.
>>>>>> 
>>>>>> 1) In our use case, the problem is that when a regular transition
>>> (say
>>>>>> offline->online) fails and goes to error state. if then the
>> resource
>>>> gets
>>>>>> removed, the participant remains in "ERROR" state so we can't reuse
>>> it
>>>>>> because in order to reuse it we need to transit to dropped first.
>>>>>> 2) The thing is, in our use case the drop comes from an api call
>>> which
>>>> is
>>>>>> not synchronized with the cluster management code which could issue
>>> the
>>>>>> reset. Also, if we reset it, wouldn't the controller push the
>>>> transitions
>>>>>> trying to have reach the ideal state again (likely triggering the
>>> same
>>>>>> issue that led to ERROR?)
>>>>>> 
>>>>>> Thanks
>>>>>> Santi
>>>>>> 
>>>>>> 
>>>>>> On Mon, Feb 11, 2013 at 5:25 PM, Zhen Zhang <[email protected]>
>>>> wrote:
>>>>>> 
>>>>>>> If we are going to add a new FATAL state, we might potentially
>> add
>>>>> FATAL
>>>>>> to
>>>>>>> all state models and all applications might have to implement
>>>>>> ERROR->FATAL
>>>>>>> and FATAL->initial_state transitions.
>>>>>>> 
>>>>>>> On the other hand, I have a couple of questions:
>>>>>>> 1) why in your use case, ERROR state is inevitable?
>>>>>>> 2) if a partition goes to ERROR state, could we reset it, so only
>>>> error
>>>>>>> partitions will get an ERROR->initial_state transition and then
>>> drop
>>>>> it?
>>>>>> If
>>>>>>> no error happens during ERROR->initial_state, the error is
>>>> recoverable,
>>>>>> and
>>>>>>> the resource will be dropped. otherwise, if something goes wrong
>>> with
>>>>>>> ERROR->initial_state, participant remains in ERROR state, drop
>>>> failed,
>>>>>> and
>>>>>>> the resource is not reusable?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Jason
>>>>>>> 
>>>>>>> On Mon, Feb 11, 2013 at 1:47 PM, Santiago Perez <
>>>> [email protected]
>>>>>>>> wrote:
>>>>>>> 
>>>>>>>> For our use case that's somewhat problematic. It's still better
>>>> than
>>>>>> the
>>>>>>>> current inability to go from error to dropped but the problem
>> is
>>>> now
>>>>> if
>>>>>>>> something goes wrong when dropping there's no way to know that
>>> from
>>>>> the
>>>>>>>> participant states. And that's actually the only unrecoverable
>>>>>> situation
>>>>>>>> for our use case. Basically it means that the participant
>> cannot
>>> be
>>>>>>> reused
>>>>>>>> for another purpose. An alternative solution would be to have a
>>>> FATAL
>>>>>>> state
>>>>>>>> that is reached when a failure occurs when transitioning out of
>>> the
>>>>>> ERROR
>>>>>>>> state.
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Santi
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Feb 6, 2013 at 1:57 PM, Zhen Zhang <[email protected]>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I am going to add the support of  error->drop transition in
>>>> Helix.
>>>>>> The
>>>>>>>>> basic idea is to remove DROPPED state from state model;
>> instead
>>>> we
>>>>>> add
>>>>>>> a
>>>>>>>>> drop() (or cleanup()) abstract method in StateModel.
>>> Applications
>>>>>> need
>>>>>>> to
>>>>>>>>> implement this abstract method to take care of the drop
>> logic.
>>>> This
>>>>>>>>> requires no change on the controller side. On the participant
>>>> side,
>>>>>>> when
>>>>>>>>> the participant receives a state-transition message with
>>>>>>> ToState=DROPPED,
>>>>>>>>> it will invoke the drop() method in the state model. When the
>>>>> drop()
>>>>>>> gets
>>>>>>>>> executed, the partition will be removed from the current
>> state
>>>>>>> regardless
>>>>>>>>> of any errors/exceptions during the execution of drop(). This
>>>> will
>>>>>>>> prevent
>>>>>>>>> the infinite loop of calling drop() in case of
>> error/exception
>>> in
>>>>> the
>>>>>>>>> execution of drop(). The advantage of this design is that we
>>> can
>>>>>> remove
>>>>>>>>> DROPPED state totally from all state model definitions, which
>>>> keeps
>>>>>> the
>>>>>>>>> state model simple. The disadvantage is, in drop() the
>>>> application
>>>>>> need
>>>>>>>> to
>>>>>>>>> take different drop logics based on the current state (e.g.
>>>> MASTER,
>>>>>>>> SLAVE,
>>>>>>>>> or ERROR, which will be the FromState in the message). Any
>>>>>> suggestions?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Jason
>>

Re: support error->drop transition in helix

Reply via email to