Lahiru,

I added my comments to the google doc. 
https://docs.google.com/document/d/11fjql09tOiC0NLBaqdhZ9WAiMoBhkBJl7WC1N7DigcU/edit

About the pull model: We don’t want to create locking issues at database level 
as Orchestrator and GFAC will be monitoring similar tables. Another problem i 
can see is a delay created by pull model to submit user job. GFAC need to look 
for submitted jobs and there need to be a frequency set to check the database. 
That will create a delay to handle user submission. Thats why i like the Async 
submission of job by Orchestrator using GFAC SPI.

Thanks
Raminder

On Dec 9, 2013, at 9:12 AM, Lahiru Gunathilake <[email protected]> wrote:

> Hi Raman,
> 
> 
> On Fri, Dec 6, 2013 at 12:34 PM, Raminder Singh <[email protected]> wrote:
> Lahiru: Can you please start a document to record this conversation? There 
> are very valuable points to records and don’t want to loose anything in email 
> threads. 
> 
> My comments are inline with prefix RS>>: 
> 
> On Dec 5, 2013, at 10:12 PM, Lahiru Gunathilake <[email protected]> wrote:
> 
>> Hi Amila,
>> 
>> I have answered questions you raised except some how to questions (for how 
>> questions we need to figure out solutions, before that we need to come up 
>> with good design).
>> 
>> 
>> On Thu, Dec 5, 2013 at 7:58 PM, Amila Jayasekara <[email protected]> 
>> wrote:
>> 
>> 
>> 
>> On Thu, Dec 5, 2013 at 2:34 PM, Lahiru Gunathilake <[email protected]> wrote:
>> Hi All,
>> 
>> We are thinking of implementing an Airavata Orchestrator component to 
>> replace WorkflowInterpreter to avoid gateway developers to dealing with 
>> workflows when they simply have one single independent jobs to run in their 
>> gateways. This component is mainly focusing on how to invoke GFAC and accept 
>> requests from the client API.
>> 
>> I have following features in mind about this component.
>> 
>> 1. It gives a web services or REST interface where we can implement a client 
>> to invoke it to submit jobs.
> RS >> We need a API method to handle this and protocol interfacing of API can 
> be handled separately using Thrift or Web services. 
> 
>> 
>> 2. Accepts a job request and parse the input types and if input types are 
>> correct, this will create an Airavata experiment ID.
> RS >> According to me, we need to save every request to registry before 
> verification and have a input configuration error if the inputs were not 
> correct. That will help us to find if there were any API invocation errors. 
> +1, we need to save the request to registry right away. 
> 
>> 
>> 3. Orchestrtor then store the job information to registry against the 
>> generated experiment ID (All the other components identify the job using 
>> this experiment ID).
>> 
>> 4. After that Orchestrator pull up all the descriptors related to this 
>> request and do some scheduling to decide where to run the job and submit the 
>> job to a GFAC node (Handling multiple GFAC nodes is going to be a future 
>> improvement in Orchestrator).
>> 
>> If we are trying to do pull based job submission it might be a good idea to 
>> handle errors, if we store jobs to Registry and GFAC pull jobs and execute 
>> them Orchestrator component really doesn' t have to worry about the error 
>> handling.
>> 
>> I did not quite understand what you meant by "pull based job submission". I 
>> believe it is saving job in registry and periodically GFAC looking up for 
>> new jobs and submitting them.
>> Yes. 
> RS >> I think orchestrator should call GFAC to invoke the job than GFAC 
> polling for the jobs. Orchestrator should make a decision that to which 
> instance of GFAC it submit the job and if there is a system error then bring 
> up or communicate to another instance.I think pull based model for GFAC will 
> add an overhead. We will add another point of failure.  
> Can you please explain bit more what did you mean by "another point of 
> failure" and "add an overhead". 
> 
>> Further why are you saying you dont need to worry about error handling ? 
>> What sort of errors are you considering ?
>> I am considering GFAC failures or connection between Orchestrator and GFAC 
>> goes down. 
>>  
>> 
>> Because we can implement a logic to GFAC if a particular job is not updating 
>> its status fora g iven time it assume job is hanged or either GFAC node 
>> which handles that job is fauiled, so  GFAC pull that job (we definitely 
>> need a locking mechanism here, to avoid two instances are not going to  
>> execute hanged job) and  start execute it. (If GFAC is handling a long 
>> running job still it has to update the job stutus frequently with the same 
>> status to make sure GFAC node is running).
>> 
>> I have some comments/questions on this regard;
>> 
>> 1. How are you going to detect that job is hanged ?
>> 
>> 2. We clearly need to distinguish between fault jobs and fault GFAC 
>> instances. Because GFAC replication should not pick the job if its logic is 
>> leading to hang situation.
>> I haven't seen hanged logic situation, may be there are. 
>> GFAC replication should pick the job only if primary GFAC instance is down. 
>> I believe you proposed locking mechanism to handle this scenario. But I dont 
>> see how locking mechanism going to resolve this situation. Can you explain 
>> more ?
>> Example if gfac has an logic of picking up a job which didn't response in a 
>> given time there could be a scenario where two gfac instances try to pick 
>> the same job. Ex: there are 3 gfac nodes working and one goes down with a 
>> given job. And two other nodes recognize this at the same time and try to 
>> launch the sam ejob. I was talking about locks to fix this issue.
> RS >> One way to handle is to look at job walltime. If the walltime for a 
> running job is expired and we still don’t have the status of the job then we 
> can go ahead and check the status and start cleaning up the job. 
> 
>>  
>> 2. According to your description, it seems there is no communication between 
>> GFAC instance and Orchastrator.So GFAC and Orchastrator exchange data 
>> through registry (Database). Performance might drop since we are going 
>> through persisting mediums.
>> Yes you are correct, I am assuming we are mostly focusing on implementing 
>> more reliable system and most of these jobs are running hours, and we don't 
>> need to implement high performance system for a system with  long running 
>> jobs. 
> RS >> We need to discuss this. I think orchestrator should only maintain 
> state of request not GFAC.
> 
>> 
>> 3. What is the strategy to divide jobs among GFAC instances ?
>> Not sure, we have to discuss it. 
>> 
>> 4. How to identify GFAC instance is failed ?
>> 
>> 5. How GFAC instances should be registered with the orchestrator ?
> RS >> We need to have a mechanism which record how many GFAC instance are 
> running and how many jobs per instance.  
> If we are going to do pull based model its going to be a hassle otherwise 
> orchestrator can keep track of that. 
> 
>> 
>> 6. How job cancellations are handled ?
> RS >> Single job canceling is simple and should have a API function to cancel 
> based on experiment id and/or local job id. 
> 
>> 
>> 7. What happend if Orchestrator goes down ?
>> This is under assumption Orchestrator doesn't go down (Ex: as a Head node in 
>> Map reduce). 
> RS >> I think registration of job happen outside orchestrator and 
> orchestrator/GFAC progress the states.  
> 
> 
> 
> Regards
> Lahiru
> 
> -- 
> System Analyst Programmer
> PTI Lab
> Indiana University

Reply via email to