Hi Raman,
On Fri, Dec 6, 2013 at 12:34 PM, Raminder Singh <[email protected]> wrote: > Lahiru: Can you please start a document to record this conversation? There > are very valuable points to records and don’t want to loose anything in > email threads. > > My comments are inline with prefix RS>>: > > On Dec 5, 2013, at 10:12 PM, Lahiru Gunathilake <[email protected]> wrote: > > Hi Amila, > > I have answered questions you raised except some how to questions (for how > questions we need to figure out solutions, before that we need to come up > with good design). > > > On Thu, Dec 5, 2013 at 7:58 PM, Amila Jayasekara > <[email protected]>wrote: > >> >> >> >> On Thu, Dec 5, 2013 at 2:34 PM, Lahiru Gunathilake <[email protected]>wrote: >> >>> Hi All, >>> >>> We are thinking of implementing an Airavata Orchestrator component to >>> replace WorkflowInterpreter to avoid gateway developers to dealing with >>> workflows when they simply have one single independent jobs to run in their >>> gateways. This component is mainly focusing on how to invoke GFAC and >>> accept requests from the client API. >>> >>> I have following features in mind about this component. >>> >>> 1. It gives a web services or REST interface where we can implement a >>> client to invoke it to submit jobs. >>> >> RS >> We need a API method to handle this and protocol interfacing of API > can be handled separately using Thrift or Web services. > > >>> 2. Accepts a job request and parse the input types and if input types >>> are correct, this will create an Airavata experiment ID. >>> >> RS >> According to me, we need to save every request to registry before > verification and have a input configuration error if the inputs were not > correct. That will help us to find if there were any API invocation errors. > +1, we need to save the request to registry right away. > > >>> 3. Orchestrtor then store the job information to registry against the >>> generated experiment ID (All the other components identify the job using >>> this experiment ID). >>> >>> 4. After that Orchestrator pull up all the descriptors related to this >>> request and do some scheduling to decide where to run the job and submit >>> the job to a GFAC node (Handling multiple GFAC nodes is going to be a >>> future improvement in Orchestrator). >>> >>> If we are trying to do pull based job submission it might be a good idea >>> to handle errors, if we store jobs to Registry and GFAC pull jobs and >>> execute them Orchestrator component really doesn' t have to worry about the >>> error handling. >>> >> >> I did not quite understand what you meant by "pull based job >> submission". I believe it is saving job in registry and periodically GFAC >> looking up for new jobs and submitting them. >> > Yes. > > RS >> I think orchestrator should call GFAC to invoke the job than GFAC > polling for the jobs. Orchestrator should make a decision that to which > instance of GFAC it submit the job and if there is a system error then > bring up or communicate to another instance.I think pull based model for > GFAC will add an overhead. We will add another point of failure. > Can you please explain bit more what did you mean by "another point of failure" and "add an overhead". > > Further why are you saying you dont need to worry about error handling ? >> What sort of errors are you considering ? >> > I am considering GFAC failures or connection between Orchestrator and GFAC > goes down. > >> >> >>> >>> Because we can implement a logic to GFAC if a particular job is not >>> updating its status fora g iven time it assume job is hanged or either GFAC >>> node which handles that job is fauiled, so GFAC pull that job (we >>> definitely need a locking mechanism here, to avoid two instances are not >>> going to execute hanged job) and start execute it. (If GFAC is handling a >>> long running job still it has to update the job stutus frequently with the >>> same status to make sure GFAC node is running). >>> >> >> I have some comments/questions on this regard; >> >> 1. How are you going to detect that job is hanged ? >> >> 2. We clearly need to distinguish between fault jobs and fault GFAC >> instances. Because GFAC replication should not pick the job if its logic is >> leading to hang situation. >> > I haven't seen hanged logic situation, may be there are. > >> GFAC replication should pick the job only if primary GFAC instance is >> down. I believe you proposed locking mechanism to handle this scenario. But >> I dont see how locking mechanism going to resolve this situation. Can you >> explain more ? >> > Example if gfac has an logic of picking up a job which didn't response in > a given time there could be a scenario where two gfac instances try to pick > the same job. Ex: there are 3 gfac nodes working and one goes down with a > given job. And two other nodes recognize this at the same time and try to > launch the sam ejob. I was talking about locks to fix this issue. > > RS >> One way to handle is to look at job walltime. If the walltime for a > running job is expired and we still don’t have the status of the job then > we can go ahead and check the status and start cleaning up the job. > > >> 2. According to your description, it seems there is no communication >> between GFAC instance and Orchastrator.So GFAC and Orchastrator exchange >> data through registry (Database). Performance might drop since we are going >> through persisting mediums. >> > Yes you are correct, I am assuming we are mostly focusing on implementing > more reliable system and most of these jobs are running hours, and we don't > need to implement high performance system for a system with long running > jobs. > > RS >> We need to discuss this. I think orchestrator should only maintain > state of request not GFAC. > > >> 3. What is the strategy to divide jobs among GFAC instances ? >> > Not sure, we have to discuss it. > > >> 4. How to identify GFAC instance is failed ? >> > >> 5. How GFAC instances should be registered with the orchestrator ? >> > RS >> We need to have a mechanism which record how many GFAC instance are > running and how many jobs per instance. > If we are going to do pull based model its going to be a hassle otherwise orchestrator can keep track of that. > > >> 6. How job cancellations are handled ? >> > RS >> Single job canceling is simple and should have a API function to > cancel based on experiment id and/or local job id. > > >> 7. What happend if Orchestrator goes down ? >> > This is under assumption Orchestrator doesn't go down (Ex: as a Head node > in Map reduce). > > RS >> I think registration of job happen outside orchestrator and > orchestrator/GFAC progress the states. > > > Regards Lahiru -- System Analyst Programmer PTI Lab Indiana University
