[ 
https://issues.apache.org/jira/browse/AIRAVATA-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108976#comment-16108976
 ] 

Eroma edited comment on AIRAVATA-2388 at 8/1/17 2:29 PM:
---------------------------------------------------------

1. In this issue although we didn't receive the job ID on submission step 
return (For this no wait time. When the job is submitted we get a response 
which has job ID on it.) the next is to qstat/squeue with job name and try to 
get the job ID. 
2. Three tries with 10 second intervals in each step. When the job ID is not 
received, the the experiment is tagged as FAILED
3. But it seems the job was submitted and ran because the emails on job start 
and end has received from the system.

This issue need to be investigated further.
1. Try to locate a job which job ID was returned in the verification step (to 
make sure that works)
2. Try and calculate the time gap between actual job submission and the last 
verification step which didn't return the job ID (Since we have the job started 
time from email, with queued time we should be able to get a rough estimation)
3. Review the job submission return message, when the job ID is not returned 
what does this message contain, is it same content at all times

Actions to take
1. Increase the number of verification steps and see whether the job ID returns
2. Change the current squeue command to sacct in SLURM machines. The new 
command will locate the job even if it is completed and not in the queue.
3. If none of above steps returns a job ID delete the job, this way the SUs 
wont be used and email system will not get unread mails accumulated. This step 
is more like a clean up step.



was (Author: eroma_a):
1. In this issue although we didn't receive the job ID on submission step 
return (For this no wait time. When the job is submitted we get a response 
which has job ID on it.) the next is to qstat/squeue with job name and try to 
get the job ID. 
2. Three tries with 10 second intervals in each step. When the job ID is not 
received, the the experiment is tagged as FAILED
3. But it seems the job was submitted and ran because the emails on job start 
and end has received from the system.

This issue need to be investigated further.
1. Try to locate a job which job ID was returned in the verification step (to 
make sure that works)
2. Try and calculate the time gap between actual job submission and the last 
verification step which didn't return the job ID (Since we have the job started 
time from email, with queued time we should be able to get a rough estimation)
3. Review the job submission return message, when the job ID is not returned 
what does this message contain, is it same content at all times



> Job ID is not returned by the cluster  when airavata check for job ID 
> ----------------------------------------------------------------------
>
>                 Key: AIRAVATA-2388
>                 URL: https://issues.apache.org/jira/browse/AIRAVATA-2388
>             Project: Airavata
>          Issue Type: Sub-task
>          Components: Airavata Job Monitor, GFac
>    Affects Versions: 0.17
>            Reporter: Eroma
>             Fix For: 0.18
>
>
> When airavata waits for a job ID it was not returned but it actually was 
> submitted and executed in the cluster. Error in the logs would be like [1]. 
> Emails are sent but since we have already tagged experiment failure, airavata 
> is not monitoring for the emails.
> [1]
> org.apache.airavata.gfac.core.GFacException: Error: userFriendly msg :Error 
> while executing JOB_SUBMISSION task, actual msg :expId: 
> h2o_9d4058c6-219c-4a10-911c-f99f605eba3f, processId: 
> PROCESS_2e82822e-3f30-4981-bd4b-c9d2a7ac355a, taskId: 
> TASK_1c52403c-7966-4f8b-b6d8-290f7516c56e, type: JOB_SUBMISSION :- 
> JOB_SUBMISSION failed. Reason: Couldn't find job id in both submitted and 
> verified steps



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to