Eroma created AIRAVATA-2943:
-------------------------------
Summary: Re-queueing and node failures in HPC clusters need to be
handled in gateway middleware as resubmitting failures
Key: AIRAVATA-2943
URL: https://issues.apache.org/jira/browse/AIRAVATA-2943
Project: Airavata
Issue Type: Bug
Components: helix implementation
Affects Versions: 0.18
Environment: https://staging.ultrascan.scigap.org slurm job ID 8560 in
Jetstream
Reporter: Eroma
Assignee: Dimuthu Upeksha
Fix For: 0.18
Currently in clusters (PBS and SLURM) jobs are getting either re-queued due to
node failures. In such scenarios the jobs are been executed after re-queueing
but on gateway side it is taken as a FAILED job at the initial NODE_FAIL.
These types of failures need to be captured as retrying failures instead of
taking it as an end result.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)