Lahiru, I added my comments to the google doc. https://docs.google.com/document/d/11fjql09tOiC0NLBaqdhZ9WAiMoBhkBJl7WC1N7DigcU/edit
About the pull model: We don’t want to create locking issues at database level as Orchestrator and GFAC will be monitoring similar tables. Another problem i can see is a delay created by pull model to submit user job. GFAC need to look for submitted jobs and there need to be a frequency set to check the database. That will create a delay to handle user submission. Thats why i like the Async submission of job by Orchestrator using GFAC SPI. Thanks Raminder On Dec 9, 2013, at 9:12 AM, Lahiru Gunathilake <[email protected]> wrote: > Hi Raman, > > > On Fri, Dec 6, 2013 at 12:34 PM, Raminder Singh <[email protected]> wrote: > Lahiru: Can you please start a document to record this conversation? There > are very valuable points to records and don’t want to loose anything in email > threads. > > My comments are inline with prefix RS>>: > > On Dec 5, 2013, at 10:12 PM, Lahiru Gunathilake <[email protected]> wrote: > >> Hi Amila, >> >> I have answered questions you raised except some how to questions (for how >> questions we need to figure out solutions, before that we need to come up >> with good design). >> >> >> On Thu, Dec 5, 2013 at 7:58 PM, Amila Jayasekara <[email protected]> >> wrote: >> >> >> >> On Thu, Dec 5, 2013 at 2:34 PM, Lahiru Gunathilake <[email protected]> wrote: >> Hi All, >> >> We are thinking of implementing an Airavata Orchestrator component to >> replace WorkflowInterpreter to avoid gateway developers to dealing with >> workflows when they simply have one single independent jobs to run in their >> gateways. This component is mainly focusing on how to invoke GFAC and accept >> requests from the client API. >> >> I have following features in mind about this component. >> >> 1. It gives a web services or REST interface where we can implement a client >> to invoke it to submit jobs. > RS >> We need a API method to handle this and protocol interfacing of API can > be handled separately using Thrift or Web services. > >> >> 2. Accepts a job request and parse the input types and if input types are >> correct, this will create an Airavata experiment ID. > RS >> According to me, we need to save every request to registry before > verification and have a input configuration error if the inputs were not > correct. That will help us to find if there were any API invocation errors. > +1, we need to save the request to registry right away. > >> >> 3. Orchestrtor then store the job information to registry against the >> generated experiment ID (All the other components identify the job using >> this experiment ID). >> >> 4. After that Orchestrator pull up all the descriptors related to this >> request and do some scheduling to decide where to run the job and submit the >> job to a GFAC node (Handling multiple GFAC nodes is going to be a future >> improvement in Orchestrator). >> >> If we are trying to do pull based job submission it might be a good idea to >> handle errors, if we store jobs to Registry and GFAC pull jobs and execute >> them Orchestrator component really doesn' t have to worry about the error >> handling. >> >> I did not quite understand what you meant by "pull based job submission". I >> believe it is saving job in registry and periodically GFAC looking up for >> new jobs and submitting them. >> Yes. > RS >> I think orchestrator should call GFAC to invoke the job than GFAC > polling for the jobs. Orchestrator should make a decision that to which > instance of GFAC it submit the job and if there is a system error then bring > up or communicate to another instance.I think pull based model for GFAC will > add an overhead. We will add another point of failure. > Can you please explain bit more what did you mean by "another point of > failure" and "add an overhead". > >> Further why are you saying you dont need to worry about error handling ? >> What sort of errors are you considering ? >> I am considering GFAC failures or connection between Orchestrator and GFAC >> goes down. >> >> >> Because we can implement a logic to GFAC if a particular job is not updating >> its status fora g iven time it assume job is hanged or either GFAC node >> which handles that job is fauiled, so GFAC pull that job (we definitely >> need a locking mechanism here, to avoid two instances are not going to >> execute hanged job) and start execute it. (If GFAC is handling a long >> running job still it has to update the job stutus frequently with the same >> status to make sure GFAC node is running). >> >> I have some comments/questions on this regard; >> >> 1. How are you going to detect that job is hanged ? >> >> 2. We clearly need to distinguish between fault jobs and fault GFAC >> instances. Because GFAC replication should not pick the job if its logic is >> leading to hang situation. >> I haven't seen hanged logic situation, may be there are. >> GFAC replication should pick the job only if primary GFAC instance is down. >> I believe you proposed locking mechanism to handle this scenario. But I dont >> see how locking mechanism going to resolve this situation. Can you explain >> more ? >> Example if gfac has an logic of picking up a job which didn't response in a >> given time there could be a scenario where two gfac instances try to pick >> the same job. Ex: there are 3 gfac nodes working and one goes down with a >> given job. And two other nodes recognize this at the same time and try to >> launch the sam ejob. I was talking about locks to fix this issue. > RS >> One way to handle is to look at job walltime. If the walltime for a > running job is expired and we still don’t have the status of the job then we > can go ahead and check the status and start cleaning up the job. > >> >> 2. According to your description, it seems there is no communication between >> GFAC instance and Orchastrator.So GFAC and Orchastrator exchange data >> through registry (Database). Performance might drop since we are going >> through persisting mediums. >> Yes you are correct, I am assuming we are mostly focusing on implementing >> more reliable system and most of these jobs are running hours, and we don't >> need to implement high performance system for a system with long running >> jobs. > RS >> We need to discuss this. I think orchestrator should only maintain > state of request not GFAC. > >> >> 3. What is the strategy to divide jobs among GFAC instances ? >> Not sure, we have to discuss it. >> >> 4. How to identify GFAC instance is failed ? >> >> 5. How GFAC instances should be registered with the orchestrator ? > RS >> We need to have a mechanism which record how many GFAC instance are > running and how many jobs per instance. > If we are going to do pull based model its going to be a hassle otherwise > orchestrator can keep track of that. > >> >> 6. How job cancellations are handled ? > RS >> Single job canceling is simple and should have a API function to cancel > based on experiment id and/or local job id. > >> >> 7. What happend if Orchestrator goes down ? >> This is under assumption Orchestrator doesn't go down (Ex: as a Head node in >> Map reduce). > RS >> I think registration of job happen outside orchestrator and > orchestrator/GFAC progress the states. > > > > Regards > Lahiru > > -- > System Analyst Programmer > PTI Lab > Indiana University
