[
https://issues.apache.org/jira/browse/AIRAVATA-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Marcus Christie updated AIRAVATA-2327:
--------------------------------------
Description:
Zhong with the dREG gateway reported an experiment where the status was "stuck"
in EXECUTING but the job had status COMPLETED. It looks like what happened is
that the api-orch service on gw56 was shutdown probably at the same time that
the orchestrator was handling the COMPLETED process status message. The
process status subscriber [automatically acks
messages|https://github.com/apache/airavata/blob/3f29cfdbd71de18777557713dce58007a3cbc2f5/modules/messaging/core/src/main/java/org/apache/airavata/messaging/core/MessagingFactory.java#L120]
so it was taken out of the queue and not available when the orchestrator was
restarted.
In gfac's log, the process completes at 2017-02-17 13:41:01
{noformat}
2017-02-17 13:41:01 [pool-9-thread-11] INFO o.a.a.g.core.context.ProcessContext
- expId: Clone_of_2M_data_82c732b8-5bd5-4e24-b1cc-ce3fd480d677, processId:
PROCESS_3b22553a-b9ed-4250-a1dd-8b555ecede80 :- Process status changed
OUTPUT_DATA_S
{noformat}
api-orch was shut down and restarted several times around the same time
{noformat}
2017-02-17 13:37:03 [main] INFO o.a.a.api.server.AiravataAPIServer - API server
started over TLS on Port: 9930 ...
...
2017-02-17 13:40:23 [main] INFO o.a.a.api.server.AiravataAPIServer - API server
started over TLS on Port: 9930 ...
...
2017-02-17 13:43:02 [main] INFO o.a.a.api.server.AiravataAPIServer - API server
started over TLS on Port: 9930 ...
...
2017-02-17 13:48:23 [main] INFO o.a.a.api.server.AiravataAPIServer - API server
started over TLS on Port: 9930 ...
...
2017-02-17 14:10:58 [main] INFO o.a.a.api.server.AiravataAPIServer - API server
started over TLS on Port: 9930 ...
{noformat}
A couple of solution ideas:
* make the status queue subscriber set to acknowledge messages
* have the orchestrator check the process status in the registry for every
incomplete experiment when it starts up
was:
Zhong with the dREG gateway reported an experiment where the status was "stuck"
in EXECUTING but the job had status COMPLETED. It looks like what happened is
that the api-orch service on gw56 was shutdown probably at the same time that
the orchestrator was handling the COMPLETED process status message. The
process status subscriber [automatically acks
messages|https://github.com/apache/airavata/blob/3f29cfdbd71de18777557713dce58007a3cbc2f5/modules/messaging/core/src/main/java/org/apache/airavata/messaging/core/MessagingFactory.java#L120]
so it was taken out of the queue and not available when the orchestrator was
restarted.
In gfac's log, the process completes at 2017-02-17 13:41:01
{noformat}
2017-02-17 13:41:01 [pool-9-thread-11] INFO o.a.a.g.core.context.ProcessContext
- expId: Clone_of_2M_data_82c732b8-5bd5-4e24-b1cc-ce3fd480d677, processId:
PROCESS_3b22553a-b9ed-4250-a1dd-8b555ecede80 :- Process status changed
OUTPUT_DATA_S
{noformat}
api-orch was shut down and restarted several times around the same time
{noformat}
2017-02-17 13:37:03 [main] INFO o.a.a.api.server.AiravataAPIServer - API server
started over TLS on Port: 9930 ...
...
2017-02-17 13:40:23 [main] INFO o.a.a.api.server.AiravataAPIServer - API server
started over TLS on Port: 9930 ...
...
2017-02-17 13:43:02 [main] INFO o.a.a.api.server.AiravataAPIServer - API server
started over TLS on Port: 9930 ...
...
2017-02-17 13:48:23 [main] INFO o.a.a.api.server.AiravataAPIServer - API server
started over TLS on Port: 9930 ...
...
2017-02-17 14:10:58 [main] INFO o.a.a.api.server.AiravataAPIServer - API server
started over TLS on Port: 9930 ...
{noformat}
A couple of solution ideas:
* make the status queue subscribe set to acknowledge messages
* have the orchestrator check the process status in the registry for every
incomplete experiment when it starts up
> Process status messages lost by orchestrator
> --------------------------------------------
>
> Key: AIRAVATA-2327
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2327
> Project: Airavata
> Issue Type: Bug
> Components: Airavata Orchestrator
> Affects Versions: 0.17
> Reporter: Marcus Christie
> Assignee: Shameera Rathnayaka
> Fix For: 0.18
>
>
> Zhong with the dREG gateway reported an experiment where the status was
> "stuck" in EXECUTING but the job had status COMPLETED. It looks like what
> happened is that the api-orch service on gw56 was shutdown probably at the
> same time that the orchestrator was handling the COMPLETED process status
> message. The process status subscriber [automatically acks
> messages|https://github.com/apache/airavata/blob/3f29cfdbd71de18777557713dce58007a3cbc2f5/modules/messaging/core/src/main/java/org/apache/airavata/messaging/core/MessagingFactory.java#L120]
> so it was taken out of the queue and not available when the orchestrator was
> restarted.
> In gfac's log, the process completes at 2017-02-17 13:41:01
> {noformat}
> 2017-02-17 13:41:01 [pool-9-thread-11] INFO
> o.a.a.g.core.context.ProcessContext - expId:
> Clone_of_2M_data_82c732b8-5bd5-4e24-b1cc-ce3fd480d677, processId:
> PROCESS_3b22553a-b9ed-4250-a1dd-8b555ecede80 :- Process status changed
> OUTPUT_DATA_S
> {noformat}
> api-orch was shut down and restarted several times around the same time
> {noformat}
> 2017-02-17 13:37:03 [main] INFO o.a.a.api.server.AiravataAPIServer - API
> server started over TLS on Port: 9930 ...
> ...
> 2017-02-17 13:40:23 [main] INFO o.a.a.api.server.AiravataAPIServer - API
> server started over TLS on Port: 9930 ...
> ...
> 2017-02-17 13:43:02 [main] INFO o.a.a.api.server.AiravataAPIServer - API
> server started over TLS on Port: 9930 ...
> ...
> 2017-02-17 13:48:23 [main] INFO o.a.a.api.server.AiravataAPIServer - API
> server started over TLS on Port: 9930 ...
> ...
> 2017-02-17 14:10:58 [main] INFO o.a.a.api.server.AiravataAPIServer - API
> server started over TLS on Port: 9930 ...
> {noformat}
> A couple of solution ideas:
> * make the status queue subscriber set to acknowledge messages
> * have the orchestrator check the process status in the registry for every
> incomplete experiment when it starts up
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)