[jira] [Created] (AIRAVATA-3872) Computing resource node failure and job re-queue handing

Eroma (Jira) Tue, 27 Feb 2024 00:45:31 -0800

Eroma created AIRAVATA-3872:
-------------------------------

             Summary: Computing resource node failure and job re-queue handing
                 Key: AIRAVATA-3872
                 URL: https://issues.apache.org/jira/browse/AIRAVATA-3872
             Project: Airavata
          Issue Type: Improvement
          Components: helix implementation
         Environment: https://django.ultrascan.scigap.org/
            Reporter: Eroma
            Assignee: Dimuthu



This issue was experienced in time to time, this time in production Ultrascan 
gateway, 
[https://django.ultrascan.scigap.org/.|https://django.ultrascan.scigap.org/] 
This gateway is connected to the production stack an Django portal for admin 
operations.

When a job is submitted and queued a node failure happens, when this failure is 
notified through email notification job goes to UNKNOWN state in the gateway. 
In the remote cluster, the job gets re-queued and completed, and email 
notifications are sent. The Helix identifies UNKNOWN as a final job state and 
does not process emails sent after.

Currently, when this happens, an operational task takes care of updating the 
job status and processing the email notifications sent.

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (AIRAVATA-3872) Computing resource node failure and job re-queue handing

Reply via email to