Re: Fault Tolerant Use cases & Solutions for Job Management in Airavata

Marlon Pierce Wed, 02 Apr 2014 04:42:37 -0700

Nice thread.  This would also contribute to Airavata's elasticity in
cloud deployments.


Marlon

On 4/1/14 10:24 PM, Lahiru Gunathilake wrote:
> Actually I am planning to do a state diagram and sequence diagram for
> airavata backend. Will post it soon.
>
>
> On Tue, Apr 1, 2014 at 8:55 PM, Saminda Wijeratne <[email protected]>wrote:
>
>> Thanks Amila and Terri for your valuable insights.
>>
>> Combinning Terris' and Amilas' input, do you think the actions carried-out
>> should be managed by internal action states or through states relating to
>> various stages of an experiment? Do you have any thoughts on which design
>> would be more flexible to follow?
>>
>> One other thing I saw in CIPRES is that you have reduced the risk of whole
>> system going down because of failure of operation in one part of the system
>> by separating the main activities in to different processes. i.e. CIPRES
>> portal handles only user requests and 3 independent daemons handle
>> different aspects of job management. Terri, any other advantages you've
>> expected through this design?
>>
>> Thanks,
>> Saminda
>>
>> On Tue, Apr 1, 2014 at 4:59 PM, Schwartz, Terri <[email protected]> wrote:
>>
>>> I struggled with this in cipres and looked at it much like Amila is
>>> saying.  Anywhere, I was storing state, I would ask myself, "what happens
>>> if cipres (or its database) crashes right before this or right after
>> this?"
>>>  What will happen when cipres starts up again?  Will it assume the
>>> operation didn't run and retry it and is that safe to do?  I generally
>>> update state after initiating operations, not before, so don't have to
>> deal
>>> with the possibility that we said we did something we didn't actually do,
>>> just have to deal with the possibility that we kicked something off and
>>> didn't manage to record it.
>>>
>>> I tried to make operations idempotent as much as possible, sometimes by
>>> wrapping them in code that looks for signs of a prior attempt and cleans
>>> things up before proceeding.
>>>
>>> Terri
>>> ________________________________________
>>> From: Amila Jayasekara [[email protected]]
>>> Sent: Tuesday, April 01, 2014 1:29 PM
>>> To: [email protected]
>>> Subject: Re: Fault Tolerant Use cases & Solutions for Job Management in
>>> Airavata
>>>
>>> Hmm... If I explain this in PL concepts a state basically refers to an
>>> environment (mapping of variables to their values) :-).
>>>
>>> But in general applications (like Airavata) the state is represented by
>>> what you persist. (Provided you persist right information)
>>>
>>> E.g :- Consider getExperiments() API call. No matter how many times we
>> call
>>> this, this doesnt change the persisted data in the system. Therefore
>>> function getExperiments() doesnt change the state. Therefore we can
>> safely
>>> exclude this method call when analyzing FT. Now consider addExperiment().
>>> This adds an experiment to persistent storage and it changes the state.
>> If
>>> you are doing multiple transactions within addExperiment(), you need to
>>> consider the resulting state if program fails in between each
>> transaction.
>>> If state is inconsistent then you need to come up with a solution.
>>>
>>>
>>>
>>>
>>> On Tue, Apr 1, 2014 at 4:13 PM, Saminda Wijeratne <[email protected]
>>>> wrote:
>>>> Are you talking about modeling it similar to a state machine? if not
>> can
>>>> you elaborate what you meant by states in the system?
>>>>
>>>>
>>>> On Tue, Apr 1, 2014 at 4:00 PM, Amila Jayasekara <
>>> [email protected]
>>>>> wrote:
>>>>> One suggestion is to first identify states in the system. Then
>> identify
>>>>> actions (operation / method invocations) which change the state of
>> the
>>>>> system. Then model FT cases by analyzing system state after and
>> before
>>> a
>>>>> failure (during those operation invocations).
>>>>>
>>>>> Thanks
>>>>> Amila
>>>>>
>>>>>
>>>>> On Tue, Apr 1, 2014 at 3:49 PM, Saminda Wijeratne <
>> [email protected]
>>>>>> wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> We are trying to identify scenarios in job management which is
>>> critical
>>>>> to
>>>>>> provide fault tolerant solutions. The spreadsheet[1] contains a
>> list
>>> of
>>>>>> such use cases I have compiled to the best of my knowledge (which
>> is
>>> no
>>>>> way
>>>>>> complete). Thoughts are welcome (reply/comment or edit spreadsheet)
>>>>>>
>>>>>> I think it is particularly useful to learn how gateways like
>>>>>> CIPRES/NSG/Ultrascan (who has a large user base) already handle
>> these
>>>>>> situations. Spreadsheet updated to record those as well.
>>>>>>
>>>>>> (if you don't have edit privileges just drop me a mail/reply)
>>>>>>
>>>>>> Thanks and Regards,
>>>>>> Saminda
>>>>>>
>>>>>> 1.
>>>>>>
>>>>>>
>> https://docs.google.com/spreadsheets/d/1eukcg2nXIoMzXa0GakNQVIICMd8y0UYGGjQs32232Hs/edit#gid=1448745788
>
>

Re: Fault Tolerant Use cases & Solutions for Job Management in Airavata

Reply via email to