[
https://issues.apache.org/jira/browse/AIRAVATA-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dimuthu Upeksha closed AIRAVATA-2943.
-------------------------------------
Resolution: Fixed
> Re-queueing and node failures in HPC clusters need to be handled in gateway
> middleware as resubmitting failures
> ----------------------------------------------------------------------------------------------------------------
>
> Key: AIRAVATA-2943
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2943
> Project: Airavata
> Issue Type: Bug
> Components: helix implementation
> Affects Versions: 0.18
> Environment: https://staging.ultrascan.scigap.org slurm job ID 8560
> in Jetstream
> Reporter: Eroma
> Assignee: Dimuthu Upeksha
> Priority: Major
> Fix For: 0.18
>
>
> Currently in clusters (PBS and SLURM) jobs are getting either re-queued due
> to node failures. In such scenarios the jobs are been executed after
> re-queueing but on gateway side it is taken as a FAILED job at the initial
> NODE_FAIL.
> These types of failures need to be captured as retrying failures instead of
> taking it as an end result.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)