Eroma created AIRAVATA-3872:
-------------------------------
Summary: Computing resource node failure and job re-queue handing
Key: AIRAVATA-3872
URL: https://issues.apache.org/jira/browse/AIRAVATA-3872
Project: Airavata
Issue Type: Improvement
Components: helix implementation
Environment: https://django.ultrascan.scigap.org/
Reporter: Eroma
Assignee: Dimuthu
This issue was experienced in time to time, this time in production Ultrascan
gateway,
[https://django.ultrascan.scigap.org/.|https://django.ultrascan.scigap.org/]
This gateway is connected to the production stack an Django portal for admin
operations.
When a job is submitted and queued a node failure happens, when this failure is
notified through email notification job goes to UNKNOWN state in the gateway.
In the remote cluster, the job gets re-queued and completed, and email
notifications are sent. The Helix identifies UNKNOWN as a final job state and
does not process emails sent after.
Currently, when this happens, an operational task takes care of updating the
job status and processing the email notifications sent.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)