[ 
https://issues.apache.org/jira/browse/AIRAVATA-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Christie updated AIRAVATA-2327:
--------------------------------------
    Description: 
Zhong with the dREG gateway reported an experiment where the status was "stuck" 
in EXECUTING but the job had status COMPLETED.  It looks like what happened is 
that the api-orch service on gw56 was shutdown probably at the same time that 
the orchestrator was handling the COMPLETED process status message.  The 
process status subscriber [automatically acks 
messages|https://github.com/apache/airavata/blob/3f29cfdbd71de18777557713dce58007a3cbc2f5/modules/messaging/core/src/main/java/org/apache/airavata/messaging/core/MessagingFactory.java#L120]
 so it was taken out of the queue and not available when the orchestrator was 
restarted.

In gfac's log, the process completes at 2017-02-17 13:41:01
{noformat}
2017-02-17 13:41:01 [pool-9-thread-11] INFO o.a.a.g.core.context.ProcessContext 
- expId: Clone_of_2M_data_82c732b8-5bd5-4e24-b1cc-ce3fd480d677, processId: 
PROCESS_3b22553a-b9ed-4250-a1dd-8b555ecede80 :- Process status changed 
OUTPUT_DATA_S
{noformat}

api-orch was shut down and restarted several times around the same time
{noformat}
2017-02-17 13:37:03 [main] INFO o.a.a.api.server.AiravataAPIServer - API server 
started over TLS on Port: 9930 ...
...
2017-02-17 13:40:23 [main] INFO o.a.a.api.server.AiravataAPIServer - API server 
started over TLS on Port: 9930 ...
...
2017-02-17 13:43:02 [main] INFO o.a.a.api.server.AiravataAPIServer - API server 
started over TLS on Port: 9930 ...
...
2017-02-17 13:48:23 [main] INFO o.a.a.api.server.AiravataAPIServer - API server 
started over TLS on Port: 9930 ...
...
2017-02-17 14:10:58 [main] INFO o.a.a.api.server.AiravataAPIServer - API server 
started over TLS on Port: 9930 ...
{noformat}


A couple of solution ideas:
* make the status queue subscriber set to acknowledge messages
* have the orchestrator check the process status in the registry for every 
incomplete experiment when it starts up



  was:
Zhong with the dREG gateway reported an experiment where the status was "stuck" 
in EXECUTING but the job had status COMPLETED.  It looks like what happened is 
that the api-orch service on gw56 was shutdown probably at the same time that 
the orchestrator was handling the COMPLETED process status message.  The 
process status subscriber [automatically acks 
messages|https://github.com/apache/airavata/blob/3f29cfdbd71de18777557713dce58007a3cbc2f5/modules/messaging/core/src/main/java/org/apache/airavata/messaging/core/MessagingFactory.java#L120]
 so it was taken out of the queue and not available when the orchestrator was 
restarted.

In gfac's log, the process completes at 2017-02-17 13:41:01
{noformat}
2017-02-17 13:41:01 [pool-9-thread-11] INFO o.a.a.g.core.context.ProcessContext 
- expId: Clone_of_2M_data_82c732b8-5bd5-4e24-b1cc-ce3fd480d677, processId: 
PROCESS_3b22553a-b9ed-4250-a1dd-8b555ecede80 :- Process status changed 
OUTPUT_DATA_S
{noformat}

api-orch was shut down and restarted several times around the same time
{noformat}
2017-02-17 13:37:03 [main] INFO o.a.a.api.server.AiravataAPIServer - API server 
started over TLS on Port: 9930 ...
...
2017-02-17 13:40:23 [main] INFO o.a.a.api.server.AiravataAPIServer - API server 
started over TLS on Port: 9930 ...
...
2017-02-17 13:43:02 [main] INFO o.a.a.api.server.AiravataAPIServer - API server 
started over TLS on Port: 9930 ...
...
2017-02-17 13:48:23 [main] INFO o.a.a.api.server.AiravataAPIServer - API server 
started over TLS on Port: 9930 ...
...
2017-02-17 14:10:58 [main] INFO o.a.a.api.server.AiravataAPIServer - API server 
started over TLS on Port: 9930 ...
{noformat}


A couple of solution ideas:
* make the status queue subscribe set to acknowledge messages
* have the orchestrator check the process status in the registry for every 
incomplete experiment when it starts up




> Process status messages lost by orchestrator
> --------------------------------------------
>
>                 Key: AIRAVATA-2327
>                 URL: https://issues.apache.org/jira/browse/AIRAVATA-2327
>             Project: Airavata
>          Issue Type: Bug
>          Components: Airavata Orchestrator
>    Affects Versions: 0.17
>            Reporter: Marcus Christie
>            Assignee: Shameera Rathnayaka
>             Fix For: 0.18
>
>
> Zhong with the dREG gateway reported an experiment where the status was 
> "stuck" in EXECUTING but the job had status COMPLETED.  It looks like what 
> happened is that the api-orch service on gw56 was shutdown probably at the 
> same time that the orchestrator was handling the COMPLETED process status 
> message.  The process status subscriber [automatically acks 
> messages|https://github.com/apache/airavata/blob/3f29cfdbd71de18777557713dce58007a3cbc2f5/modules/messaging/core/src/main/java/org/apache/airavata/messaging/core/MessagingFactory.java#L120]
>  so it was taken out of the queue and not available when the orchestrator was 
> restarted.
> In gfac's log, the process completes at 2017-02-17 13:41:01
> {noformat}
> 2017-02-17 13:41:01 [pool-9-thread-11] INFO 
> o.a.a.g.core.context.ProcessContext - expId: 
> Clone_of_2M_data_82c732b8-5bd5-4e24-b1cc-ce3fd480d677, processId: 
> PROCESS_3b22553a-b9ed-4250-a1dd-8b555ecede80 :- Process status changed 
> OUTPUT_DATA_S
> {noformat}
> api-orch was shut down and restarted several times around the same time
> {noformat}
> 2017-02-17 13:37:03 [main] INFO o.a.a.api.server.AiravataAPIServer - API 
> server started over TLS on Port: 9930 ...
> ...
> 2017-02-17 13:40:23 [main] INFO o.a.a.api.server.AiravataAPIServer - API 
> server started over TLS on Port: 9930 ...
> ...
> 2017-02-17 13:43:02 [main] INFO o.a.a.api.server.AiravataAPIServer - API 
> server started over TLS on Port: 9930 ...
> ...
> 2017-02-17 13:48:23 [main] INFO o.a.a.api.server.AiravataAPIServer - API 
> server started over TLS on Port: 9930 ...
> ...
> 2017-02-17 14:10:58 [main] INFO o.a.a.api.server.AiravataAPIServer - API 
> server started over TLS on Port: 9930 ...
> {noformat}
> A couple of solution ideas:
> * make the status queue subscriber set to acknowledge messages
> * have the orchestrator check the process status in the registry for every 
> incomplete experiment when it starts up



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to