RE: Fault Tolerant Use cases & Solutions for Job Management in Airavata

Schwartz, Terri Wed, 02 Apr 2014 07:35:49 -0700

Hi Saminda,

Not sure I understand your question, but regarding the 2nd paragraph, like you 
said, I wanted to avoid problems like memory leaks or remote operations not 
timing out promptly, from impacting anything else.  Also, the separate 
processes can easily be run on different machines if we need to scale that way.


Terri
________________________________________
From: Saminda Wijeratne [[email protected]]
Sent: Tuesday, April 01, 2014 5:55 PM
To: architecture
Subject: Re: Fault Tolerant Use cases & Solutions for Job Management in Airavata

Thanks Amila and Terri for your valuable insights.

Combinning Terris' and Amilas' input, do you think the actions carried-out
should be managed by internal action states or through states relating to
various stages of an experiment? Do you have any thoughts on which design
would be more flexible to follow?

One other thing I saw in CIPRES is that you have reduced the risk of whole
system going down because of failure of operation in one part of the system
by separating the main activities in to different processes. i.e. CIPRES
portal handles only user requests and 3 independent daemons handle
different aspects of job management. Terri, any other advantages you've
expected through this design?

Thanks,
Saminda

On Tue, Apr 1, 2014 at 4:59 PM, Schwartz, Terri <[email protected]> wrote:

> I struggled with this in cipres and looked at it much like Amila is
> saying.  Anywhere, I was storing state, I would ask myself, "what happens
> if cipres (or its database) crashes right before this or right after this?"
>  What will happen when cipres starts up again?  Will it assume the
> operation didn't run and retry it and is that safe to do?  I generally
> update state after initiating operations, not before, so don't have to deal
> with the possibility that we said we did something we didn't actually do,
> just have to deal with the possibility that we kicked something off and
> didn't manage to record it.
>
> I tried to make operations idempotent as much as possible, sometimes by
> wrapping them in code that looks for signs of a prior attempt and cleans
> things up before proceeding.
>
> Terri
> ________________________________________
> From: Amila Jayasekara [[email protected]]
> Sent: Tuesday, April 01, 2014 1:29 PM
> To: [email protected]
> Subject: Re: Fault Tolerant Use cases & Solutions for Job Management in
> Airavata
>
> Hmm... If I explain this in PL concepts a state basically refers to an
> environment (mapping of variables to their values) :-).
>
> But in general applications (like Airavata) the state is represented by
> what you persist. (Provided you persist right information)
>
> E.g :- Consider getExperiments() API call. No matter how many times we call
> this, this doesnt change the persisted data in the system. Therefore
> function getExperiments() doesnt change the state. Therefore we can safely
> exclude this method call when analyzing FT. Now consider addExperiment().
> This adds an experiment to persistent storage and it changes the state. If
> you are doing multiple transactions within addExperiment(), you need to
> consider the resulting state if program fails in between each transaction.
> If state is inconsistent then you need to come up with a solution.
>
>
>
>
> On Tue, Apr 1, 2014 at 4:13 PM, Saminda Wijeratne <[email protected]
> >wrote:
>
> > Are you talking about modeling it similar to a state machine? if not can
> > you elaborate what you meant by states in the system?
> >
> >
> > On Tue, Apr 1, 2014 at 4:00 PM, Amila Jayasekara <
> [email protected]
> > >wrote:
> >
> > > One suggestion is to first identify states in the system. Then identify
> > > actions (operation / method invocations) which change the state of the
> > > system. Then model FT cases by analyzing system state after and before
> a
> > > failure (during those operation invocations).
> > >
> > > Thanks
> > > Amila
> > >
> > >
> > > On Tue, Apr 1, 2014 at 3:49 PM, Saminda Wijeratne <[email protected]
> > > >wrote:
> > >
> > > > Hi All,
> > > >
> > > > We are trying to identify scenarios in job management which is
> critical
> > > to
> > > > provide fault tolerant solutions. The spreadsheet[1] contains a list
> of
> > > > such use cases I have compiled to the best of my knowledge (which is
> no
> > > way
> > > > complete). Thoughts are welcome (reply/comment or edit spreadsheet)
> > > >
> > > > I think it is particularly useful to learn how gateways like
> > > > CIPRES/NSG/Ultrascan (who has a large user base) already handle these
> > > > situations. Spreadsheet updated to record those as well.
> > > >
> > > > (if you don't have edit privileges just drop me a mail/reply)
> > > >
> > > > Thanks and Regards,
> > > > Saminda
> > > >
> > > > 1.
> > > >
> > > >
> > >
> >
> https://docs.google.com/spreadsheets/d/1eukcg2nXIoMzXa0GakNQVIICMd8y0UYGGjQs32232Hs/edit#gid=1448745788
> > > >
> > >
> >
>

RE: Fault Tolerant Use cases & Solutions for Job Management in Airavata

Reply via email to