Sounds good to me too Terence
On Mar 22, 2013, at 11:46 AM, kishore g <[email protected]> wrote: > Looks good, lets make sure we add this to the javadoc of statemodel so that > users know when those methods are invoked. > > > On Fri, Mar 22, 2013 at 11:29 AM, Zhen Zhang <[email protected]> wrote: > >> Hi, I am fine with FATAL state, but I think we should clearly separate >> helix defined states from user defined states. Helix define states (i.e. >> ERROR, DROPPED, FATAL) need not to be defined in state model and state >> transitions logic involving helix defined states should be common to all >> state models. In addition, helix should provide default implementation for >> transitions involving helix defined states. In case applications don't care >> about them, they don't implement these transitions. Here are what I am >> thinking of: >> >> - Helix will invoke StateModel.onError() if current state is any user >> defined state and error occurs in the transition. >> >> - Helix will invoke StateModel.drop() if current state is ERROR and target >> state is DROPPED. If drop() succeeds, ERROR will transit to initial state >> and then to DROPPED; otherwise to FATAL state. >> >> - Helix will invoke StateModel.reset() if current state is FATAL and we >> issue a reset command. If reset() succeeds, FATAL will transit to initial >> state; otherwise remain in FATAL state. Also reset() should be invoked only >> by admin commands, so in case reset() fails, we don't call it infinitely. >> >> Thanks, >> Jason >> >> >> On Fri, Mar 22, 2013 at 5:36 AM, Santiago Perez <[email protected] >>> wrote: >> >>> I personally prefer the FATAL state approach. What do you think Jason? >>> >>> >>> On Fri, Mar 22, 2013 at 4:50 AM, kishore g <[email protected]> wrote: >>> >>>> Hi Terence/Jason/Santi, >>>> >>>> Did we come to a conclusion on this. Terence proposal looks good to me. >>> If >>>> adding FATAL state is more invasive, I suggest simply disabling the >>>> partition on that node and set some reason for disabling for >>>> auditing/diagnosis. The advantage of this is if the underlying error is >>>> rectified then one can enable the partition and transition ERROR->DROP >>> will >>>> be invoked. Disabling ensures that even if node restarts it will not >> host >>>> that partition again. >>>> >>>> thanks, >>>> Kishore G >>>> >>>> >>>> On Mon, Feb 11, 2013 at 8:58 PM, Terence Yim <[email protected]> wrote: >>>> >>>>> I proposed the FATAL state to Kishore before. Let me write it down >>> again >>>>> for discussion. >>>>> >>>>> 1. An extra state, "FATAL", is introduced. It is a system state, just >>>> like >>>>> the existing ERROR state, which doesn't need to be explicitly defined >>> in >>>>> state model. >>>>> 2. Just like the current implementation, whenever there is any error >>>> during >>>>> participant state transition, transit the participant into ERROR >> state >>>> and >>>>> stay there. >>>>> 3. Also just like current implementation, when a given resource is >>>> deleted, >>>>> trigger state transition from CURRENT_STATE -> DROPPED (and goes >>> through >>>>> necessary state transition based on the state model). >>>>> 4. For participants that have current state = ERROR, trigger >>>> ERROR->DROPPED >>>>> transition (can have a default callback in the StateModel that do >>> nothing >>>>> in this transition, but it's up to further discussion). >>>>> 5. If and only if there is exception thrown during the ERROR->DROPPED >>>>> transition, transit the participant to FATAL state. >>>>> 6. When a participant gets into FATAL state, there is no way for it >> to >>>> get >>>>> out of it without human intervention, meaning a human need to inspect >>> and >>>>> reset it manually (or through some tools). >>>>> >>>>> With this, there would be changes in Controller, but no change in >>>>> participant if there nothing to specially handled during >> ERROR->DROPPED >>>>> transition. Also, all error handling would be done with state >>> transition, >>>>> which gives the participant more consistent way on handling different >>>>> scenarios. This also guarantees that every calls are sync and thread >>>> safe. >>>>> >>>>> Terence >>>>> >>>>> On Mon, Feb 11, 2013 at 7:23 PM, Santiago Perez < >> [email protected] >>>>>> wrote: >>>>> >>>>>> In my proposal FATAL would be a final state, manual intervention >>>>> required. >>>>>> >>>>>> 1) In our use case, the problem is that when a regular transition >>> (say >>>>>> offline->online) fails and goes to error state. if then the >> resource >>>> gets >>>>>> removed, the participant remains in "ERROR" state so we can't reuse >>> it >>>>>> because in order to reuse it we need to transit to dropped first. >>>>>> 2) The thing is, in our use case the drop comes from an api call >>> which >>>> is >>>>>> not synchronized with the cluster management code which could issue >>> the >>>>>> reset. Also, if we reset it, wouldn't the controller push the >>>> transitions >>>>>> trying to have reach the ideal state again (likely triggering the >>> same >>>>>> issue that led to ERROR?) >>>>>> >>>>>> Thanks >>>>>> Santi >>>>>> >>>>>> >>>>>> On Mon, Feb 11, 2013 at 5:25 PM, Zhen Zhang <[email protected]> >>>> wrote: >>>>>> >>>>>>> If we are going to add a new FATAL state, we might potentially >> add >>>>> FATAL >>>>>> to >>>>>>> all state models and all applications might have to implement >>>>>> ERROR->FATAL >>>>>>> and FATAL->initial_state transitions. >>>>>>> >>>>>>> On the other hand, I have a couple of questions: >>>>>>> 1) why in your use case, ERROR state is inevitable? >>>>>>> 2) if a partition goes to ERROR state, could we reset it, so only >>>> error >>>>>>> partitions will get an ERROR->initial_state transition and then >>> drop >>>>> it? >>>>>> If >>>>>>> no error happens during ERROR->initial_state, the error is >>>> recoverable, >>>>>> and >>>>>>> the resource will be dropped. otherwise, if something goes wrong >>> with >>>>>>> ERROR->initial_state, participant remains in ERROR state, drop >>>> failed, >>>>>> and >>>>>>> the resource is not reusable? >>>>>>> >>>>>>> Thanks, >>>>>>> Jason >>>>>>> >>>>>>> On Mon, Feb 11, 2013 at 1:47 PM, Santiago Perez < >>>> [email protected] >>>>>>>> wrote: >>>>>>> >>>>>>>> For our use case that's somewhat problematic. It's still better >>>> than >>>>>> the >>>>>>>> current inability to go from error to dropped but the problem >> is >>>> now >>>>> if >>>>>>>> something goes wrong when dropping there's no way to know that >>> from >>>>> the >>>>>>>> participant states. And that's actually the only unrecoverable >>>>>> situation >>>>>>>> for our use case. Basically it means that the participant >> cannot >>> be >>>>>>> reused >>>>>>>> for another purpose. An alternative solution would be to have a >>>> FATAL >>>>>>> state >>>>>>>> that is reached when a failure occurs when transitioning out of >>> the >>>>>> ERROR >>>>>>>> state. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Santi >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Feb 6, 2013 at 1:57 PM, Zhen Zhang <[email protected]> >>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I am going to add the support of error->drop transition in >>>> Helix. >>>>>> The >>>>>>>>> basic idea is to remove DROPPED state from state model; >> instead >>>> we >>>>>> add >>>>>>> a >>>>>>>>> drop() (or cleanup()) abstract method in StateModel. >>> Applications >>>>>> need >>>>>>> to >>>>>>>>> implement this abstract method to take care of the drop >> logic. >>>> This >>>>>>>>> requires no change on the controller side. On the participant >>>> side, >>>>>>> when >>>>>>>>> the participant receives a state-transition message with >>>>>>> ToState=DROPPED, >>>>>>>>> it will invoke the drop() method in the state model. When the >>>>> drop() >>>>>>> gets >>>>>>>>> executed, the partition will be removed from the current >> state >>>>>>> regardless >>>>>>>>> of any errors/exceptions during the execution of drop(). This >>>> will >>>>>>>> prevent >>>>>>>>> the infinite loop of calling drop() in case of >> error/exception >>> in >>>>> the >>>>>>>>> execution of drop(). The advantage of this design is that we >>> can >>>>>> remove >>>>>>>>> DROPPED state totally from all state model definitions, which >>>> keeps >>>>>> the >>>>>>>>> state model simple. The disadvantage is, in drop() the >>>> application >>>>>> need >>>>>>>> to >>>>>>>>> take different drop logics based on the current state (e.g. >>>> MASTER, >>>>>>>> SLAVE, >>>>>>>>> or ERROR, which will be the FromState in the message). Any >>>>>> suggestions? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Jason >>
