Eroma created AIRAVATA-2943:
-------------------------------

             Summary: Re-queueing and node failures in HPC clusters need to be 
handled in gateway middleware as resubmitting failures 
                 Key: AIRAVATA-2943
                 URL: https://issues.apache.org/jira/browse/AIRAVATA-2943
             Project: Airavata
          Issue Type: Bug
          Components: helix implementation
    Affects Versions: 0.18
         Environment: https://staging.ultrascan.scigap.org slurm job ID 8560 in 
Jetstream
            Reporter: Eroma
            Assignee: Dimuthu Upeksha
             Fix For: 0.18


Currently in clusters (PBS and SLURM) jobs are getting either re-queued due to 
node failures. In such scenarios the jobs are been executed after re-queueing 
but on gateway side it is taken as a FAILED job at the initial NODE_FAIL. 

These types of failures need to be captured as retrying failures instead of 
taking it as an end result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to