[ https://issues.apache.org/jira/browse/AIRAVATA-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lahiru Jayathilake updated AIRAVATA-3893: ----------------------------------------- Summary: Support automated HPC Job Re-Submission across Clusters after HPC-Side failures (was: Support for Automatic Resubmission of Failed Jobs After Successful Submission) > Support automated HPC Job Re-Submission across Clusters after HPC-Side > failures > ------------------------------------------------------------------------------- > > Key: AIRAVATA-3893 > URL: https://issues.apache.org/jira/browse/AIRAVATA-3893 > Project: Airavata > Issue Type: Improvement > Components: Airavata System > Reporter: Lahiru Jayathilake > Priority: Major > > Currently, the Airavata Metascheduler does not have the capability to > automatically resubmit jobs to other clusters if the job has been > successfully submitted but fails during execution (e.g., due to resource > allocation issues). > This feature request aims to enhance the Metascheduler by introducing the > ability to handle such job failures more effectively. The Metascheduler > should automatically attempt to resubmit failed jobs to other configured > clusters, ensuring more reliable completion of experiments. > This enhancement will improve the system’s robustness in handling transient > failures or resource constraints across multiple clusters. -- This message was sent by Atlassian Jira (v8.20.10#820010)