RE: Fault Tolerant Use cases & Solutions for Job Management in Airavata

Schwartz, Terri Tue, 01 Apr 2014 14:00:32 -0700

I struggled with this in cipres and looked at it much like Amila is saying.  
Anywhere, I was storing state, I would ask myself, "what happens if cipres (or 
its database) crashes right before this or right after this?"  What will happen 
when cipres starts up again?  Will it assume the operation didn't run and retry 
it and is that safe to do?  I generally update state after initiating 
operations, not before, so don't have to deal with the possibility that we said 
we did something we didn't actually do, just have to deal with the possibility 
that we kicked something off and didn't manage to record it.

I tried to make operations idempotent as much as possible, sometimes by 
wrapping them in code that looks for signs of a prior attempt and cleans things 
up before proceeding. 

Terri
________________________________________
From: Amila Jayasekara [[email protected]]
Sent: Tuesday, April 01, 2014 1:29 PM
To: [email protected]
Subject: Re: Fault Tolerant Use cases & Solutions for Job Management in Airavata

Hmm... If I explain this in PL concepts a state basically refers to an
environment (mapping of variables to their values) :-).

But in general applications (like Airavata) the state is represented by
what you persist. (Provided you persist right information)

E.g :- Consider getExperiments() API call. No matter how many times we call
this, this doesnt change the persisted data in the system. Therefore
function getExperiments() doesnt change the state. Therefore we can safely
exclude this method call when analyzing FT. Now consider addExperiment().
This adds an experiment to persistent storage and it changes the state. If
you are doing multiple transactions within addExperiment(), you need to
consider the resulting state if program fails in between each transaction.
If state is inconsistent then you need to come up with a solution.

On Tue, Apr 1, 2014 at 4:13 PM, Saminda Wijeratne <[email protected]>wrote:

> Are you talking about modeling it similar to a state machine? if not can
> you elaborate what you meant by states in the system?
>
>
> On Tue, Apr 1, 2014 at 4:00 PM, Amila Jayasekara <[email protected]
> >wrote:
>
> > One suggestion is to first identify states in the system. Then identify
> > actions (operation / method invocations) which change the state of the
> > system. Then model FT cases by analyzing system state after and before a
> > failure (during those operation invocations).
> >
> > Thanks
> > Amila
> >
> >
> > On Tue, Apr 1, 2014 at 3:49 PM, Saminda Wijeratne <[email protected]
> > >wrote:
> >
> > > Hi All,
> > >
> > > We are trying to identify scenarios in job management which is critical
> > to
> > > provide fault tolerant solutions. The spreadsheet[1] contains a list of
> > > such use cases I have compiled to the best of my knowledge (which is no
> > way
> > > complete). Thoughts are welcome (reply/comment or edit spreadsheet)
> > >
> > > I think it is particularly useful to learn how gateways like
> > > CIPRES/NSG/Ultrascan (who has a large user base) already handle these
> > > situations. Spreadsheet updated to record those as well.
> > >
> > > (if you don't have edit privileges just drop me a mail/reply)
> > >
> > > Thanks and Regards,
> > > Saminda
> > >
> > > 1.
> > >
> > >
> >
> https://docs.google.com/spreadsheets/d/1eukcg2nXIoMzXa0GakNQVIICMd8y0UYGGjQs32232Hs/edit#gid=1448745788
> > >
> >
>

RE: Fault Tolerant Use cases & Solutions for Job Management in Airavata

Reply via email to