Thanks Lahiru. 

I will give this a try and test for different cases. 

Raminder

On Aug 19, 2014, at 5:42 AM, Lahiru Gunathilake <[email protected]> wrote:

> Hi All,
> 
> I have committed the initial version of the Experiment canceling.
> 
> Experiment cancel is an Airavata-API method which can be invoked by the 
> Airavata client. This request will get to the GFac Provider level 
> cancellation only if the job is already submitted to the computing resource, 
> otherwise it will be handled by the orchestrator.
> 
> If cancel request comes to an Experiment already completed, failed or 
> cancelling, cancel operation will be failed and error will be throw to the 
> client.
> 
> If the job is marked cancelled successfully, experiment launch execution will 
> be stopped in the next immediate plugin invocation(launchExperiment operation 
> which runs in a separate thread). Ex: GFac is running Handler1 during cancel 
> and experiment launch execution will be stopped before the next plugin 
> invocation. 
> Limitation: if there is 500 file transfer in Input Handlers(currently 
> transferring file number 100) and during that step if  user cancel the 
> experiment rest of the files will transfer and before the next plugin 
> original execution will be cancelled. (If we want to download partial outputs 
> we have to modify this logic). GFac framework can handle cancel(thats what we 
> have now) or framework can just try to execute all the plugins and plugin 
> implementation listen to a cancellation for that particular execution and act 
> accordingly. 
> 
> If the job is already submitted and Gfac is monitoring the job, it will be 
> cancelled by invoking providers cancel operation. Experiment statuses,Task 
> Statuses,Job Statuses will be updated properly and monitoring will be stopped 
> for those jobs with terminating Job statuses by the monitoring results.
> 
> When there are multiple Gfac instances, original experiment launch request 
> can go to gfac Node1(separate jvm)and the cancel request doesn't have to go 
> to the same gfac Node. Orchestrator will handle this scenario and make the 
> job cancel request successful and experiment launch will be stopped as 
> explained above.
> 
> During GFac node failure there could be jobs launching and job cancel 
> executions happening in that instance. Orchestrator will route both type of 
> requests to an available gfac nodes and recover the executions.
> 
> I have a knowns issue to be fixed, which is when I run the cancel operation 
> sometimes GFac level authentication fails, I will try to find out what is 
> happenning, this problem comes time to time and I am not sure whether this is 
> something related to cancel feature or something to do with trestles.
> 
> Regards
> Lahiru
> 
> 
> 
> 
> On Mon, Aug 18, 2014 at 7:13 PM, Lahiru Gunathilake <[email protected]> wrote:
> Hi Marlon,
> 
> I should be able to wrap-up later today or early tomorrow. 
> 
> Regards
> Lahiru
> 
> 
> On Mon, Aug 18, 2014 at 7:01 PM, Marlon Pierce <[email protected]> wrote:
> How goes the implementation?
> 
> Marlon
> 
> 
> On 8/13/14, 11:09 PM, Lahiru Gunathilake wrote:
> Thank you very much for all the inputs ! This will take these in to
> consideration.
> 
> Regards
> Lahiru
> 
> 
> On Wed, Aug 13, 2014 at 10:31 PM, Miller, Mark <[email protected]> wrote:
> 
>   If I understand this correctly, I want to offer some input from our
> experience with CIPRES.
> 
> Currently, if a CIPRES user wishes to cancel a job, they must delete the
> entire job, and therefore all ability to view the input and other files
> used become unavailable.
> 
> This is not an ideal solution.
> 
> 
> 
> There is value to the user to being able to see partially completed
> results, or even the input files they used.
> 
> 
> 
> So I would vote for making partial output of the job available as an
> option.
> 
> Any additional information you can provide about status would be useful,
> especially for folks who are debugging failures..
> 
> 
> 
> Just my 2c.
> 
> 
> 
> Mark
> 
> 
> 
> *From:* Eroma Abeysinghe [mailto:[email protected]]
> *Sent:* Wednesday, August 13, 2014 7:04 AM
> *To:* [email protected]
> *Subject:* Re: Experiment Cancellation
> 
> 
> 
> 
> My questions and thoughts on Experiment cancellation
> 1. What are we going to do for output or partial output of the job at the
> time of cancelling?
>      Are we going to discard or make them available for the experiment. Are
> we safe keeping all the job information, messages on CANCELLED jobs or
> discard them as well?
> 
> 2. Are we going to allow editing for CANCELLED or CANCELLING experiments?
> IMO we should not. because allowing editing is required if its going to
> Re-launch.
> 
> 3. With existing experiment and job states we need to decide which are
> going to be CANCELLED
> Out of Airavata Experiment states Cancellation should be allowed for
> states;
> CREATED
> VALIDATED
> SCHEDULED
> LAUNCHED
> EXECUTING
> Cancellation should be communicated to resources if the job states are;
> SUBMITTED
> SETUP
> QUEUED
> ACTIVE
> HELD
> 
> 
> There is SUSPENDED state in both experiment and job but is this a
> currently active state?
> 
> 4. Cloning will be available for CANCELLED and CANCELLING experiments.
> 
> 5. In Experiment Summary we should display any errors took place in
> cancelling process
> 
> 
> 
> 
> 
> On Wed, Aug 13, 2014 at 9:01 AM, Marlon Pierce <[email protected]> wrote:
> 
> There is an advantage for task (or job) state to capture the information
> that really comes from the machine (completed, cancelled, failed, etc), and
> for experiment state to be set to canceled by Airavata.  That is, there
> should be parts of Airavata that capture machine-specific state information
> about the job for logging/auditing purposes.
> 
> * Airavata issues "cancel" command to job in "launched" or "executing"
> state.
> 
> * Airavata confirms that the job has left the queue or is no longer
> executing. This could be machine-specific, but the main question is "has
> the job left the queue?" or "is the job no longer in executing state?"  I
> don't think it is "if this is trestles, and since we issued a qdel command,
> is the job marked as completed; of if this is stampede, is the job now
> marked as failed?"
> 
> * If the job cancel works, the Airavata marks this as canceled.
> 
> * If cancel fails for some reason, don't change the Experiment state but
> throw an error.
> 
> 
> Marlon
> 
> 
> 
> On 8/13/14, 2:57 AM, Lahiru Gunathilake wrote:
> 
> Hi All,
> 
> I have few concerns about experiment cancellation. When we want to cancel
> and experiment we have to run a particular command in the computing
> resource. Based on the computing resource different resources show the job
> status of the cancelled jobs in a different way. Ex: trestles shows the
> cancelled jobs as completed, some other machines show it as as cancelled,
> some might show it as failed.
> 
> I think we should replicated this information in the JobDetails object as
> the Job status and make sure the Experiments and Task statuses as
> cancelled. The other approach is when we cancel we explicitly make all the
> states in the experiment model (experiments,tasks,job states as cancelled)
> as cancelled and manually handle the state we get from the computing
> resource.
> 
> My concerns should we really hide that information shown in the computing
> resource from the Job status we are storing in to the registry ? or leave
> it as it is and handle other statuses to represent the cancelled
> experiments ? If we make everything cancel there will be inconsistency in
> the JobStatus.
> 
> WDYT ?
> 
> Lahiru
> 
> 
> 
> 
> 
> 
> --
> 
> Thank You,
> 
> Best Regards,
> 
> Eroma
> 
> 
> 
> 
> 
> 
> 
> -- 
> System Analyst Programmer
> PTI Lab
> Indiana University
> 
> 
> 
> -- 
> System Analyst Programmer
> PTI Lab
> Indiana University

Reply via email to