Marcus Christie created AIRAVATA-2327:
-----------------------------------------

             Summary: Process status messages lost by orchestrator
                 Key: AIRAVATA-2327
                 URL: https://issues.apache.org/jira/browse/AIRAVATA-2327
             Project: Airavata
          Issue Type: Bug
          Components: Airavata Orchestrator
    Affects Versions: 0.17
            Reporter: Marcus Christie
            Assignee: Shameera Rathnayaka
             Fix For: 0.18


Zhong with the dREG gateway reported an experiment where the status was "stuck" 
in EXECUTING but the job had status COMPLETED.  It looks like what happened is 
that the api-orch service on gw56 was shutdown probably at the same time that 
the orchestrator was handling the COMPLETED process status message.  The 
process status subscriber [automatically acks 
messages|https://github.com/apache/airavata/blob/3f29cfdbd71de18777557713dce58007a3cbc2f5/modules/messaging/core/src/main/java/org/apache/airavata/messaging/core/MessagingFactory.java#L120]
 so it was taken out of the queue and not available when the orchestrator was 
restarted.

In gfac's log, the process completes at 2017-02-17 13:41:01
{noformat}
2017-02-17 13:41:01 [pool-9-thread-11] INFO o.a.a.g.core.context.ProcessContext 
- expId: Clone_of_2M_data_82c732b8-5bd5-4e24-b1cc-ce3fd480d677, processId: 
PROCESS_3b22553a-b9ed-4250-a1dd-8b555ecede80 :- Process status changed 
OUTPUT_DATA_S
{noformat}

api-orch was shut down and restarted several times around the same time
{noformat}
2017-02-17 13:37:03 [main] INFO o.a.a.api.server.AiravataAPIServer - API server 
started over TLS on Port: 9930 ...
...
2017-02-17 13:40:23 [main] INFO o.a.a.api.server.AiravataAPIServer - API server 
started over TLS on Port: 9930 ...
...
2017-02-17 13:43:02 [main] INFO o.a.a.api.server.AiravataAPIServer - API server 
started over TLS on Port: 9930 ...
...
2017-02-17 13:48:23 [main] INFO o.a.a.api.server.AiravataAPIServer - API server 
started over TLS on Port: 9930 ...
...
2017-02-17 14:10:58 [main] INFO o.a.a.api.server.AiravataAPIServer - API server 
started over TLS on Port: 9930 ...
{noformat}


A couple of solution ideas:
* make the status queue subscribe set to acknowledge messages
* have the orchestrator check the process status in the registry for every 
incomplete experiment when it starts up





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to