[jira] [Commented] (AIRAVATA-1635) [GSoC] Integrate Airavata Java Client SDK with GridChem Client

2015-03-18 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366971#comment-14366971
 ] 

Dimuthu Upeksha commented on AIRAVATA-1635:
---

In GridChem-Client [1] build and bin folders also have pushed to the 
repository. Is it necessary to have them in repo?
This will lead to huge list of changes once the project is built locally.

[1] https://github.com/SciGaP/sha2-GridChem-client

 [GSoC] Integrate Airavata Java Client SDK with GridChem Client 
 ---

 Key: AIRAVATA-1635
 URL: https://issues.apache.org/jira/browse/AIRAVATA-1635
 Project: Airavata
  Issue Type: Epic
Reporter: Suresh Marru
  Labels: gsoc, gsoc2015, mentor

 GridChem is a Science Gateway enables users to run computational experiments 
 on multiple supercomputing resources. Currently GridChem, a java swing based 
 webstart client [1] uses a Axis2 based Middleware Service [2] which brokers 
 users actions into computational jobs. 
 This project needs to understand the Client [1] and port it to use Apache 
 Airavata java client SDK. The project has following components:
 * Integrate GridChem client with Airavata User Store (implemented by WSO2 
 Identity Server)
 * Integrate with Airavata API for application executions.
 * Integrate with Atlassian JIRA + Confluence for user error reporting and 
 status notifications.
 [1] - https://github.com/SciGaP/sha2-GridChem-client
 [2] - https://github.com/SciGaP/sha2-gms



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AIRAVATA-1635) [GSoC] Integrate Airavata Java Client SDK with GridChem Client

2015-03-17 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365777#comment-14365777
 ] 

Dimuthu Upeksha commented on AIRAVATA-1635:
---

1. Is there any documentation that describes about the API functions of 
Middleware Service [2] and Airavata SDK?
2. Do I have to setup a local SCiGap (Airavata) instance or is there a existing 
setup?


 [GSoC] Integrate Airavata Java Client SDK with GridChem Client 
 ---

 Key: AIRAVATA-1635
 URL: https://issues.apache.org/jira/browse/AIRAVATA-1635
 Project: Airavata
  Issue Type: Epic
Reporter: Suresh Marru
  Labels: gsoc, gsoc2015, mentor

 GridChem is a Science Gateway enables users to run computational experiments 
 on multiple supercomputing resources. Currently GridChem, a java swing based 
 webstart client [1] uses a Axis2 based Middleware Service [2] which brokers 
 users actions into computational jobs. 
 This project needs to understand the Client [1] and port it to use Apache 
 Airavata java client SDK. The project has following components:
 * Integrate GridChem client with Airavata User Store (implemented by WSO2 
 Identity Server)
 * Integrate with Airavata API for application executions.
 * Integrate with Atlassian JIRA + Confluence for user error reporting and 
 status notifications.
 [1] - https://github.com/SciGaP/sha2-GridChem-client
 [2] - https://github.com/SciGaP/sha2-gms



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AIRAVATA-1635) [GSoC] Integrate Airavata Java Client SDK with GridChem Client

2015-03-26 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14382132#comment-14382132
 ] 

Dimuthu Upeksha commented on AIRAVATA-1635:
---

Suresh/Sudhakar,
Can I have access to a working system of GridChem as we have discussed earlier? 
Then I'll be able to get familiar with its use cases.

 [GSoC] Integrate Airavata Java Client SDK with GridChem Client 
 ---

 Key: AIRAVATA-1635
 URL: https://issues.apache.org/jira/browse/AIRAVATA-1635
 Project: Airavata
  Issue Type: Epic
Reporter: Suresh Marru
  Labels: gsoc, gsoc2015, mentor

 GridChem is a Science Gateway enables users to run computational experiments 
 on multiple supercomputing resources. Currently GridChem, a java swing based 
 webstart client [1] uses a Axis2 based Middleware Service [2] which brokers 
 users actions into computational jobs. 
 This project needs to understand the Client [1] and port it to use Apache 
 Airavata java client SDK. The project has following components:
 * Integrate GridChem client with Airavata User Store (implemented by WSO2 
 Identity Server)
 * Integrate with Airavata API for application executions.
 * Integrate with Atlassian JIRA + Confluence for user error reporting and 
 status notifications.
 [1] - https://github.com/SciGaP/GridChem-Client
 [2] - https://github.com/SciGaP/GridChem-Middleware-Service



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AIRAVATA-1635) [GSoC] Integrate Airavata Java Client SDK with GridChem Client

2015-03-27 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383553#comment-14383553
 ] 

Dimuthu Upeksha commented on AIRAVATA-1635:
---

I tried GridChem client by putting some jobs on the servers and getting 
outputs. Input files that we push has a format like this.

%chk=water.chk
%nprocshared=1
%mem=500MB
#P RHF/6-31g* opt pop=reg gfinput gfprint iop(6/7=3) SCF=direct 
 
Gaussian Test Job 00
Water with archiving
 
0 1
O
H 1 0.96
H 1 0.96 2 109.471221

Is this format some standard way of passing jobs or specific format for 
GridChem?
Does client directly pass this file to middleware or parse this and pass only 
necessary data?

 [GSoC] Integrate Airavata Java Client SDK with GridChem Client 
 ---

 Key: AIRAVATA-1635
 URL: https://issues.apache.org/jira/browse/AIRAVATA-1635
 Project: Airavata
  Issue Type: Epic
Reporter: Suresh Marru
  Labels: gsoc, gsoc2015, mentor

 GridChem is a Science Gateway enables users to run computational experiments 
 on multiple supercomputing resources. Currently GridChem, a java swing based 
 webstart client [1] uses a Axis2 based Middleware Service [2] which brokers 
 users actions into computational jobs. 
 This project needs to understand the Client [1] and port it to use Apache 
 Airavata java client SDK. The project has following components:
 * Integrate GridChem client with Airavata User Store (implemented by WSO2 
 Identity Server)
 * Integrate with Airavata API for application executions.
 * Integrate with Atlassian JIRA + Confluence for user error reporting and 
 status notifications.
 [1] - https://github.com/SciGaP/GridChem-Client
 [2] - https://github.com/SciGaP/GridChem-Middleware-Service



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AIRAVATA-1635) [GSoC] Integrate Airavata Java Client SDK with GridChem Client

2015-04-23 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510352#comment-14510352
 ] 

Dimuthu Upeksha commented on AIRAVATA-1635:
---

Hi Suresh,

I have almost finished porting Create/Edit Experiment interface of GridChem 
client. I need to test it with applications like Gaussian or GAMESS. Some 
resources can be found form [1] and [2] but they do not explain how to register 
those applications in locally deployed Airavata Server. What are the steps I 
need to follow in order to install them in my local machine? Or can I get an 
already hosted server with these applications registered for my testing 
purposes? 

[1] 
https://cwiki.apache.org/confluence/display/AIRAVATA/Script+Example+-+GridChem+Gaussian
[2] https://cwiki.apache.org/confluence/display/AIRAVATA/Gaussian+Input+examples

 [GSoC] Integrate Airavata Java Client SDK with GridChem Client 
 ---

 Key: AIRAVATA-1635
 URL: https://issues.apache.org/jira/browse/AIRAVATA-1635
 Project: Airavata
  Issue Type: Epic
Reporter: Suresh Marru
  Labels: gsoc, gsoc2015, mentor
 Fix For: WISHLIST


 GridChem is a Science Gateway enables users to run computational experiments 
 on multiple supercomputing resources. Currently GridChem, a java swing based 
 webstart client [1] uses a Axis2 based Middleware Service [2] which brokers 
 users actions into computational jobs. 
 This project needs to understand the Client [1] and port it to use Apache 
 Airavata java client SDK. The project has following components:
 * Integrate GridChem client with Airavata User Store (implemented by WSO2 
 Identity Server)
 * Integrate with Airavata API for application executions.
 * Integrate with Atlassian JIRA + Confluence for user error reporting and 
 status notifications.
 [1] - https://github.com/SciGaP/GridChem-Client
 [2] - https://github.com/SciGaP/GridChem-Middleware-Service



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AIRAVATA-1635) [GSoC] Integrate Airavata Java Client SDK with GridChem Client

2015-06-24 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600622#comment-14600622
 ] 

Dimuthu Upeksha commented on AIRAVATA-1635:
---

Short demo of current progress can be found from 
https://www.youtube.com/watch?v=YbqwemIkZngfeature=youtu.be

 [GSoC] Integrate Airavata Java Client SDK with GridChem Client 
 ---

 Key: AIRAVATA-1635
 URL: https://issues.apache.org/jira/browse/AIRAVATA-1635
 Project: Airavata
  Issue Type: Epic
Reporter: Suresh Marru
  Labels: gsoc, gsoc2015, mentor
 Fix For: WISHLIST


 GridChem is a Science Gateway enables users to run computational experiments 
 on multiple supercomputing resources. Currently GridChem, a java swing based 
 webstart client [1] uses a Axis2 based Middleware Service [2] which brokers 
 users actions into computational jobs. 
 This project needs to understand the Client [1] and port it to use Apache 
 Airavata java client SDK. The project has following components:
 * Integrate GridChem client with Airavata User Store (implemented by WSO2 
 Identity Server)
 * Integrate with Airavata API for application executions.
 * Integrate with Atlassian JIRA + Confluence for user error reporting and 
 status notifications.
 [1] - https://github.com/SciGaP/GridChem-Client
 [2] - https://github.com/SciGaP/GridChem-Middleware-Service



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (AIRAVATA-2746) Job completed and experiment failed due to error in initializing SSH agent

2018-05-08 Thread Dimuthu Upeksha (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2746.
---
Resolution: Fixed

> Job completed and experiment failed due to error in initializing SSH agent
> --
>
> Key: AIRAVATA-2746
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2746
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> # Submitted a batch of experiments and got this error as an intermittent 
> error.
>  # Job status is COMPLETED but experiment is FAILED due to the error [1]
> [1]
> rg.apache.airavata.helix.impl.task.TaskOnFailException: Error Code : 
> c8fdb6ff-e1ca-470f-9858-b5d08e9334bd, Task 
> TASK_2dd64a31-44b9-4358-b962-122fdfb36415 failed due to Failed to obtain 
> adaptor for compute resource 
> carbonate.uits.iu.edu_42a3397f-e2c6-4fda-ac9c-d8fb25be82e7 in task 
> TASK_2dd64a31-44b9-4358-b962-122fdfb36415, Error while initializing ssh agent 
> for compute resource 
> carbonate.uits.iu.edu_42a3397f-e2c6-4fda-ac9c-d8fb25be82e7 to token 
> e415b180-7a40-4ad7-8a82-b77b909b70a1 at 
> org.apache.airavata.helix.impl.task.AiravataTask.onFail(AiravataTask.java:101)
>  at 
> org.apache.airavata.helix.impl.task.staging.ArchiveTask.onRun(ArchiveTask.java:142)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:268) 
> at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:82) at 
> org.apache.helix.task.TaskRunner.run(TaskRunner.java:71) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:748) Caused by: 
> org.apache.airavata.agents.api.AgentException: Error while initializing ssh 
> agent for compute resource 
> carbonate.uits.iu.edu_42a3397f-e2c6-4fda-ac9c-d8fb25be82e7 to token 
> e415b180-7a40-4ad7-8a82-b77b909b70a1 at 
> org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.init(SshAgentAdaptor.java:116)
>  at 
> org.apache.airavata.helix.core.support.AdaptorSupportImpl.fetchAdaptor(AdaptorSupportImpl.java:59)
>  at 
> org.apache.airavata.helix.impl.task.staging.DataStagingTask.getComputeResourceAdaptor(DataStagingTask.java:90)
>  at 
> org.apache.airavata.helix.impl.task.staging.ArchiveTask.onRun(ArchiveTask.java:80)
>  ... 10 more Caused by: org.apache.airavata.agents.api.AgentException: Could 
> not create ssh session for host carbonate.uits.iu.edu at 
> org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.init(SshAgentAdaptor.java:84)
>  at 
> org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.init(SshAgentAdaptor.java:112)
>  ... 13 more Caused by: com.jcraft.jsch.JSchException: Auth cancel at 
> com.jcraft.jsch.Session.connect(Session.java:511) at 
> com.jcraft.jsch.Session.connect(Session.java:183) at 
> org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.init(SshAgentAdaptor.java:81)
>  ... 14 more
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2743) Experiment in CANCELLED while job is still QUEUED or SUBMITTED and canceling at cluster side

2018-05-08 Thread Dimuthu Upeksha (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2743.
---
Resolution: Fixed

> Experiment in CANCELLED while job is still QUEUED or SUBMITTED and canceling 
> at cluster side
> 
>
> Key: AIRAVATA-2743
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2743
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> # Submit an experiment
>  # Cancel the experiment in PGA
>  # Experiment status changes to CANCELING
>  # Experiment status changes to CANCELLED while job is in either SUBMITTED or 
> QUEUED.
>  # Experiment status should change to CANCELLED only after the job status 
> changes to an end status (CANCELLED, COMPLETED or FAILED).
>  #



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2733) Improvements to Helix log messages

2018-05-08 Thread Dimuthu Upeksha (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2733.
---
Resolution: Fixed

> Improvements to Helix log messages
> --
>
> Key: AIRAVATA-2733
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2733
> Project: Airavata
>  Issue Type: Improvement
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> New additions to the current Helix log messages
>  # Add the job submission command to the log. Currently it is not there and 
> only the job status is there.
>  # Print the complete job submission response from the cluster, this is 
> useful when an experiment and/or job fails to investigate.
>  # Print both token and description on the log for the credential store token 
> in use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2740) Non-existing file transfer has failed the experiment

2018-05-08 Thread Dimuthu Upeksha (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2740.
---
Resolution: Fixed

> Non-existing file transfer has failed the experiment
> 
>
> Key: AIRAVATA-2740
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2740
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: http://149.165.168.248:8008/
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
>
> # Due to an 0 byte input file upload tan expected output file is not 
> generated in the application execution. 
>  # The airavata tries to transfer the file an error is thrown as the file is 
> not existing.
>  # But a 0 byte output file is created in the gateway data storage and user 
> can view it.
>  # Exception is thrown [1]
>  # If a specified output file is not in the working directory, it should be 
> ignored, not create an empty file in the storage
>  # If a 0 byte file exists in the working directory it should be transferred 
> to the storage.
> [1]
> org.apache.airavata.agents.api.AgentException: java.io.FileNotFoundException: 
> /tmp/PROCESS_ac1064b1-226e-491c-91e6-303448f05f16/temp_inputs/Gaussian.log 
> (No such file or directory) at 
> org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.copyFileTo(SshAgentAdaptor.java:307)
>  at 
> org.apache.airavata.helix.agent.storage.StorageResourceAdaptorImpl.uploadFile(StorageResourceAdaptorImpl.java:98)
>  at 
> org.apache.airavata.helix.impl.task.staging.DataStagingTask.transferFileToStorage(DataStagingTask.java:158)
>  at 
> org.apache.airavata.helix.impl.task.staging.OutputDataStagingTask.onRun(OutputDataStagingTask.java:163)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:265) 
> at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:82) at 
> org.apache.helix.task.TaskRunner.run(TaskRunner.java:70) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:748) Caused by: 
> java.io.FileNotFoundException: 
> /tmp/PROCESS_ac1064b1-226e-491c-91e6-303448f05f16/temp_inputs/Gaussian.log 
> (No such file or directory) at java.io.FileInputStream.open0(Native Method) 
> at java.io.FileInputStream.open(FileInputStream.java:195) at 
> java.io.FileInputStream.(FileInputStream.java:138) at 
> java.io.FileInputStream.(FileInputStream.java:93) at 
> org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.copyFileTo(SshAgentAdaptor.java:276)
>  ... 13 more



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2747) OOM issue in Helix Participant

2018-05-08 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467931#comment-16467931
 ] 

Dimuthu Upeksha commented on AIRAVATA-2747:
---

Moved to SSHJ based ssh adaptor

https://github.com/apache/airavata/commit/a2acaac097dfd24c85c1acbb4f041a4ee65a7d95

> OOM issue in Helix Participant
> --
>
> Key: AIRAVATA-2747
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2747
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Reporter: Dimuthu Upeksha
>Assignee: Dimuthu Upeksha
>Priority: Major
> Attachments: airavata.log, threaddump-oom.log
>
>
> There seems to be an memory leak in helix participant when creating the SSH 
> sessions.
> 2018-04-11 16:06:35,916 [TaskStateModelFactory-task_thread] INFO 
> o.a.a.h.i.t.s.DataStagingTask 
> process=PROCESS_ad3fd791-a165-4e1d-bf25-cf4fa86c1970, 
> task=TASK_049812b4-5462-45dd-95a1-9c1db3a5cf73, 
> experiment=SLM001-NEK5000-BR2_08789b1b-feff-46f9-9f4b-67ee9ded280d, 
> gateway=default - Downloading output file 
> /N/dc2/scratch/cgateway/gta-work-dirs/PROCESS_ad3fd791-a165-4e1d-bf25-cf4fa86c1970/NEK5000.stdout
>  to the local path 
> /tmp/PROCESS_ad3fd791-a165-4e1d-bf25-cf4fa86c1970/temp_inputs/NEK5000.stdout
> 2018-04-11 16:06:35,929 [TaskStateModelFactory-task_thread] ERROR 
> o.apache.helix.task.TaskRunner - Problem running the task, report task as 
> FAILED.
> java.lang.OutOfMemoryError: unable to create new native thread
>  at java.lang.Thread.start0(Native Method)
>  at java.lang.Thread.start(Thread.java:717)
>  at com.jcraft.jsch.Session.connect(Session.java:528)
>  at com.jcraft.jsch.Session.connect(Session.java:183)
>  at 
> org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.init(SshAgentAdaptor.java:81)
>  at 
> org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.init(SshAgentAdaptor.java:112)
>  at 
> org.apache.airavata.helix.core.support.AdaptorSupportImpl.fetchAdaptor(AdaptorSupportImpl.java:59)
>  at 
> org.apache.airavata.helix.impl.task.submission.DefaultJobSubmissionTask.onRun(DefaultJobSubmissionTask.java:58)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:268)
>  at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:82)
>  at org.apache.helix.task.TaskRunner.run(TaskRunner.java:71)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2713) In helix test bed the outputs are not displayed in the experiment summary

2018-05-08 Thread Dimuthu Upeksha (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2713.
---
Resolution: Fixed

> In helix test bed the outputs are not displayed in the experiment summary
> -
>
> Key: AIRAVATA-2713
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2713
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> Submitted a job in helix implementation and job and experiment both got 
> completed. The outputs exists and can view and download from storage but not 
> in the experiment summary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2736) Job submitted and running in HPC while the experiment is tagged as FAILED

2018-05-08 Thread Dimuthu Upeksha (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2736.
---
Resolution: Fixed

> Job submitted and running in HPC while the experiment is tagged as FAILED
> -
>
> Key: AIRAVATA-2736
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2736
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: http://149.165.168.248:8008/ - Helix test env
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> # Submitted an experiment which then submitted the job.
>  # Job ID is returned and the status is ACTIVE.
>  # Due to zookeeper connection issue the experiment is FAILED.
>  # The job is still running in HPC
>  # Airavata is not waiting for job monitoring as the task status is not 
> updated in the zookeeper.
>  # error in log [1]
>  # SLM001-AmberSander-BR2_5ed5a19f-ab44-4eba-afb7-1feafaf0bbdd - exp ID
> [1]
> |org.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss for /monitoring/2159926/lock at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at 
> org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:778) at 
> org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:696)
>  at 
> org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:679)
>  at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) at 
> org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:676)
>  at 
> org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:453)
>  at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:443)
>  at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:44)
>  at 
> org.apache.airavata.helix.impl.task.submission.JobSubmissionTask.createMonitoringNode(JobSubmissionTask.java:83)
>  at 
> org.apache.airavata.helix.impl.task.submission.DefaultJobSubmissionTask.onRun(DefaultJobSubmissionTask.java:144)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:264) 
> at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:74) at 
> org.apache.helix.task.TaskRunner.run(TaskRunner.java:70) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:748)|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2735) When transferring input files, check for the file size and 0 byte files transfers should be restricted

2018-05-08 Thread Dimuthu Upeksha (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2735.
---
Resolution: Fixed

> When transferring input files, check for the file size and 0 byte files 
> transfers should be restricted
> --
>
> Key: AIRAVATA-2735
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2735
> Project: Airavata
>  Issue Type: Improvement
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> # When transferring input files if the file is 0 in size, file transfer task 
> should fail and experiment should fail.
>  # User should be notified about the file being empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2734) Experiment status in LAUNCEHD while job is in ACTIVE. Experiment status should be EXECUTING.

2018-05-08 Thread Dimuthu Upeksha (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2734.
---
Resolution: Fixed

> Experiment status in LAUNCEHD while job is in ACTIVE. Experiment status 
> should be EXECUTING.
> 
>
> Key: AIRAVATA-2734
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2734
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> Experiment status should change to EXECUTING when it is picked up by helix. 
> Once the status changes to EXECUTING the job status will get changed to 
> SUBMITTED, QUEUED and ACTIVE.
> Once the job is COMPLETED, experiment status will change to COMPLETED after 
> the output files transfers are completed.
>  
> Currently experiment status is LAUNCHED but the job is submitted and running. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2737) Too many Zookeeper connections created

2018-05-08 Thread Dimuthu Upeksha (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2737.
---
Resolution: Fixed

> Too many Zookeeper connections created
> --
>
> Key: AIRAVATA-2737
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2737
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> For each task in a workflow a zookeeper connection is opened. This creates 
> too many zookeeper connections and some experiments are not moving pass 
> LAUNCHED as a result



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2747) OOM issue in Helix Participant

2018-05-08 Thread Dimuthu Upeksha (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2747.
---
Resolution: Fixed

> OOM issue in Helix Participant
> --
>
> Key: AIRAVATA-2747
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2747
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Reporter: Dimuthu Upeksha
>Assignee: Dimuthu Upeksha
>Priority: Major
> Attachments: airavata.log, threaddump-oom.log
>
>
> There seems to be an memory leak in helix participant when creating the SSH 
> sessions.
> 2018-04-11 16:06:35,916 [TaskStateModelFactory-task_thread] INFO 
> o.a.a.h.i.t.s.DataStagingTask 
> process=PROCESS_ad3fd791-a165-4e1d-bf25-cf4fa86c1970, 
> task=TASK_049812b4-5462-45dd-95a1-9c1db3a5cf73, 
> experiment=SLM001-NEK5000-BR2_08789b1b-feff-46f9-9f4b-67ee9ded280d, 
> gateway=default - Downloading output file 
> /N/dc2/scratch/cgateway/gta-work-dirs/PROCESS_ad3fd791-a165-4e1d-bf25-cf4fa86c1970/NEK5000.stdout
>  to the local path 
> /tmp/PROCESS_ad3fd791-a165-4e1d-bf25-cf4fa86c1970/temp_inputs/NEK5000.stdout
> 2018-04-11 16:06:35,929 [TaskStateModelFactory-task_thread] ERROR 
> o.apache.helix.task.TaskRunner - Problem running the task, report task as 
> FAILED.
> java.lang.OutOfMemoryError: unable to create new native thread
>  at java.lang.Thread.start0(Native Method)
>  at java.lang.Thread.start(Thread.java:717)
>  at com.jcraft.jsch.Session.connect(Session.java:528)
>  at com.jcraft.jsch.Session.connect(Session.java:183)
>  at 
> org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.init(SshAgentAdaptor.java:81)
>  at 
> org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.init(SshAgentAdaptor.java:112)
>  at 
> org.apache.airavata.helix.core.support.AdaptorSupportImpl.fetchAdaptor(AdaptorSupportImpl.java:59)
>  at 
> org.apache.airavata.helix.impl.task.submission.DefaultJobSubmissionTask.onRun(DefaultJobSubmissionTask.java:58)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:268)
>  at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:82)
>  at org.apache.helix.task.TaskRunner.run(TaskRunner.java:71)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2746) Job completed and experiment failed due to error in initializing SSH agent

2018-05-08 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467935#comment-16467935
 ] 

Dimuthu Upeksha commented on AIRAVATA-2746:
---

Fixed in new SSHJ based ssh adaptor

https://github.com/apache/airavata/commit/a2acaac097dfd24c85c1acbb4f041a4ee65a7d95

> Job completed and experiment failed due to error in initializing SSH agent
> --
>
> Key: AIRAVATA-2746
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2746
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> # Submitted a batch of experiments and got this error as an intermittent 
> error.
>  # Job status is COMPLETED but experiment is FAILED due to the error [1]
> [1]
> rg.apache.airavata.helix.impl.task.TaskOnFailException: Error Code : 
> c8fdb6ff-e1ca-470f-9858-b5d08e9334bd, Task 
> TASK_2dd64a31-44b9-4358-b962-122fdfb36415 failed due to Failed to obtain 
> adaptor for compute resource 
> carbonate.uits.iu.edu_42a3397f-e2c6-4fda-ac9c-d8fb25be82e7 in task 
> TASK_2dd64a31-44b9-4358-b962-122fdfb36415, Error while initializing ssh agent 
> for compute resource 
> carbonate.uits.iu.edu_42a3397f-e2c6-4fda-ac9c-d8fb25be82e7 to token 
> e415b180-7a40-4ad7-8a82-b77b909b70a1 at 
> org.apache.airavata.helix.impl.task.AiravataTask.onFail(AiravataTask.java:101)
>  at 
> org.apache.airavata.helix.impl.task.staging.ArchiveTask.onRun(ArchiveTask.java:142)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:268) 
> at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:82) at 
> org.apache.helix.task.TaskRunner.run(TaskRunner.java:71) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:748) Caused by: 
> org.apache.airavata.agents.api.AgentException: Error while initializing ssh 
> agent for compute resource 
> carbonate.uits.iu.edu_42a3397f-e2c6-4fda-ac9c-d8fb25be82e7 to token 
> e415b180-7a40-4ad7-8a82-b77b909b70a1 at 
> org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.init(SshAgentAdaptor.java:116)
>  at 
> org.apache.airavata.helix.core.support.AdaptorSupportImpl.fetchAdaptor(AdaptorSupportImpl.java:59)
>  at 
> org.apache.airavata.helix.impl.task.staging.DataStagingTask.getComputeResourceAdaptor(DataStagingTask.java:90)
>  at 
> org.apache.airavata.helix.impl.task.staging.ArchiveTask.onRun(ArchiveTask.java:80)
>  ... 10 more Caused by: org.apache.airavata.agents.api.AgentException: Could 
> not create ssh session for host carbonate.uits.iu.edu at 
> org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.init(SshAgentAdaptor.java:84)
>  at 
> org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.init(SshAgentAdaptor.java:112)
>  ... 13 more Caused by: com.jcraft.jsch.JSchException: Auth cancel at 
> com.jcraft.jsch.Session.connect(Session.java:511) at 
> com.jcraft.jsch.Session.connect(Session.java:183) at 
> org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.init(SshAgentAdaptor.java:81)
>  ... 14 more
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2745) Job cancellations in the cluster should cancel the job and experiment in the gateway portal.

2018-05-08 Thread Dimuthu Upeksha (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2745.
---
Resolution: Fixed

> Job cancellations in the cluster should cancel the job and experiment in the 
> gateway portal.
> 
>
> Key: AIRAVATA-2745
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2745
> Project: Airavata
>  Issue Type: New Feature
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> When a user cancels the job directly in the cluster/JPC an email will be sent 
> to the monitoring. This email is sent for all slurm jobs but PBS could be 
> different and may not send this.
>  
> If airavata receives an cancel email the job should be cancelled and so is 
> the experiment irrespective of where the command executed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRAVATA-2792) Staging seagrid fails to submit a job

2018-05-18 Thread Dimuthu Upeksha (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha updated AIRAVATA-2792:
--
Component/s: helix implementation

> Staging seagrid fails to submit a job
> -
>
> Key: AIRAVATA-2792
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2792
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Reporter: Sudhakar Pamidighantam
>Assignee: Dimuthu Upeksha
>Priority: Major
>
> org.apache.airavata.helix.impl.task.TaskOnFailException: Error Code : 
> f32162d3-9409-4ba9-92c3-aee14c8e5fb4, Task 
> TASK_5bf0a74e-6d0a-48bf-87d1-1af985bd90fc failed due to Failed to setup 
> environment of task TASK_5bf0a74e-6d0a-48bf-87d1-1af985bd90fc, null at 
> org.apache.airavata.helix.impl.task.AiravataTask.onFail(AiravataTask.java:102)
>  at 
> org.apache.airavata.helix.impl.task.env.EnvSetupTask.onRun(EnvSetupTask.java:53)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:311) 
> at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:82) at 
> org.apache.helix.task.TaskRunner.run(TaskRunner.java:71) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Caused by: 
> java.lang.NullPointerException at 
> org.apache.airavata.helix.impl.task.TaskContext.getComputeResourceCredentialToken(TaskContext.java:422)
>  at 
> org.apache.airavata.helix.impl.task.env.EnvSetupTask.onRun(EnvSetupTask.java:45)
>  ... 10 more



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2620) Force post processing functionality

2017-12-24 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303021#comment-16303021
 ] 

Dimuthu Upeksha commented on AIRAVATA-2620:
---

Fixed in 
https://github.com/apache/airavata/commit/10734eeb96faf77f5bb4692833194c5abb8c3e17

> Force post processing functionality 
> 
>
> Key: AIRAVATA-2620
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2620
> Project: Airavata
>  Issue Type: Improvement
>Affects Versions: 0.16
>Reporter: Suresh Marru
>Assignee: Dimuthu Upeksha
> Fix For: 0.17
>
>
> Due to current limitations of only relying on email for job monitoring, the 
> post-processing sometimes has inherent delays. Ultrascan science gateway 
> would like to have a capability in airavata to request forcing of post 
> processing. This will be used when clients have out of band knowledge about 
> job completion (for example through code instrumented UDP messages) and would 
> like Airavata to force staging of output files.
> This improvement has to be carefully added so existing life cycle of an 
> experiment is not hampred. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AIRAVATA-2621) SSH port provided in compute resource registration is not considered for cluster SSH communication

2018-01-17 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329205#comment-16329205
 ] 

Dimuthu Upeksha commented on AIRAVATA-2621:
---

Fix will look for the SSHJobSumission instances for a compute resource if the 
job submission protocol is SSH. If it can find an instance, it will override 
the the host name an port of the ServerInfo bean with the alternateHostName and 
the port of the SSHJobSubmission instance.

> SSH port provided in compute resource registration is not considered for 
> cluster SSH communication
> --
>
> Key: AIRAVATA-2621
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2621
> Project: Airavata
>  Issue Type: Bug
>  Components: GFac
>Affects Versions: 0.18
> Environment: https://hpcgateway.gsu.edu/
> https://scigap.org/
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> 1. Added a specific port for job submissions (15022)
> 2. But when submitting jobs, for environment creation, the gfac is using the 
> default 22 port, not the specified one in scigap.org for hpclogin.gsu.edu.
> 3. log messages in airavata log
> 2017-12-19 11:13:18,996 [pool-7-thread-2] INFO  
> o.a.airavata.gfac.impl.Factory 
> process_id=PROCESS_3b471b3b-5b4e-4b6d-a66e-554652a390d2, 
> token_id=35da840b-63d5-4cbf-b9ce-3005cd94d961, 
> experiment_id=NWChem2_a38ac303-666f-4dea-9b4c-7bffe0f97dd7, 
> gateway_id=georgiastate - Initialize a new SSH session for 
> :airavata_hpclogin.gsu.edu_22_35da840b-63d5-4cbf-b9ce-3005cd94d961
> 2017-12-19 11:15:26,272 [pool-7-thread-2] ERROR o.a.a.gfac.core.GFacException 
> process_id=PROCESS_3b471b3b-5b4e-4b6d-a66e-554652a390d2, 
> token_id=35da840b-63d5-4cbf-b9ce-3005cd94d961, 
> experiment_id=NWChem2_a38ac303-666f-4dea-9b4c-7bffe0f97dd7, 
> gateway_id=georgiastate - JSch initialization error
> com.jcraft.jsch.JSchException: java.net.ConnectException: Connection timed 
> out (Connection timed out)
> at com.jcraft.jsch.Util.createSocket(Util.java:349)
> at com.jcraft.jsch.Session.connect(Session.java:215)
> at com.jcraft.jsch.Session.connect(Session.java:183)
> at 
> org.apache.airavata.gfac.impl.Factory.getSSHSession(Factory.java:542)
> at 
> org.apache.airavata.gfac.impl.HPCRemoteCluster.getSshSession(HPCRemoteCluster.java:138)
> at 
> org.apache.airavata.gfac.impl.HPCRemoteCluster.getSession(HPCRemoteCluster.java:315)
> at 
> org.apache.airavata.gfac.impl.HPCRemoteCluster.makeDirectory(HPCRemoteCluster.java:242)
> at 
> org.apache.airavata.gfac.impl.task.EnvironmentSetupTask.execute(EnvironmentSetupTask.java:51)
> at 
> org.apache.airavata.gfac.impl.GFacEngineImpl.executeTask(GFacEngineImpl.java:814)
> at 
> org.apache.airavata.gfac.impl.GFacEngineImpl.configureWorkspace(GFacEngineImpl.java:553)
> at 
> org.apache.airavata.gfac.impl.GFacEngineImpl.executeTaskListFrom(GFacEngineImpl.java:324)
> at 
> org.apache.airavata.gfac.impl.GFacEngineImpl.executeProcess(GFacEngineImpl.java:286)
> at 
> org.apache.airavata.gfac.impl.GFacWorker.executeProcess(GFacWorker.java:227)
> at org.apache.airavata.gfac.impl.GFacWorker.run(GFacWorker.java:86)
> at 
> org.apache.airavata.common.logging.MDCUtil.lambda$wrapWithMDC$0(MDCUtil.java:40)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.ConnectException: Connection timed out (Connection timed 
> out)
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
> at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at java.net.Socket.connect(Socket.java:538)
> at java.net.Socket.(Socket.java:434)
> at java.net.Socket.(Socket.java:211)
> at com.jcraft.jsch.Util.createSocket(Util.java:343)
> ... 17 common frames omitted
> ?NWChem2_a38ac303-666f-4dea-9b4c-7bffe0f97dd7



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2624) Sampede2 cluster SSH connectivity issue

2018-01-17 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329304#comment-16329304
 ] 

Dimuthu Upeksha commented on AIRAVATA-2624:
---

Fixed in 
[https://github.com/apache/airavata/commit/dc6ea56eb5435ec3f03d6a8226a44497e7290616]

Added UIKeyboardInteractive feature for DefaultUserInfo class to support 2 
factor ssh authentication.

> Sampede2 cluster SSH connectivity issue
> ---
>
> Key: AIRAVATA-2624
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2624
> Project: Airavata
>  Issue Type: Bug
>  Components: Airavata System, GFac
>Affects Versions: 0.18
> Environment: https://seagrid.org 
>Reporter: Eroma
>Assignee: Suresh Marru
>Priority: Major
> Fix For: 0.18
>
>
> Job submission fails at env creation due to JSch initialization error.
> Error messages
> 2018-01-09 09:46:10,786 [pool-7-thread-15] ERROR 
> o.a.a.gfac.core.GFacException 
> process_id=PROCESS_650014f6-fcb6-4680-90ea-898bee373f37, 
> token_id=3d65bf6d-2c9f-4166-a51b-e76e0022bd3b, 
> experiment_id=Clone_of_st2molcastest_e2942a34-c9c7-4f04-8ccb-af6fe27e0990, 
> gateway_id=seagrid - JSch initialization error
> com.jcraft.jsch.JSchException: Auth fail
> at com.jcraft.jsch.Session.connect(Session.java:512)
> at com.jcraft.jsch.Session.connect(Session.java:183)
> at 
> org.apache.airavata.gfac.impl.Factory.getSSHSession(Factory.java:542)
> at 
> org.apache.airavata.gfac.impl.HPCRemoteCluster.getSshSession(HPCRemoteCluster.java:138)
> at 
> org.apache.airavata.gfac.impl.HPCRemoteCluster.getSession(HPCRemoteCluster.java:315)
> at 
> org.apache.airavata.gfac.impl.HPCRemoteCluster.makeDirectory(HPCRemoteCluster.java:242)
> at 
> org.apache.airavata.gfac.impl.task.EnvironmentSetupTask.execute(EnvironmentSetupTask.java:51)
> at 
> org.apache.airavata.gfac.impl.GFacEngineImpl.executeTask(GFacEngineImpl.java:814)
> at 
> org.apache.airavata.gfac.impl.GFacEngineImpl.configureWorkspace(GFacEngineImpl.java:553)
> at 
> org.apache.airavata.gfac.impl.GFacEngineImpl.executeTaskListFrom(GFacEngineImpl.java:324)
> at 
> org.apache.airavata.gfac.impl.GFacEngineImpl.executeProcess(GFacEngineImpl.java:286)
> at 
> org.apache.airavata.gfac.impl.GFacWorker.executeProcess(GFacWorker.java:227)
> at org.apache.airavata.gfac.impl.GFacWorker.run(GFacWorker.java:86)
> at 
> org.apache.airavata.common.logging.MDCUtil.lambda$wrapWithMDC$0(MDCUtil.java:40)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> 2018-01-09 09:46:10,786 [pool-7-thread-15] ERROR 
> o.a.a.g.i.t.EnvironmentSetupTask 
> process_id=PROCESS_650014f6-fcb6-4680-90ea-898bee373f37, 
> token_id=3d65bf6d-2c9f-4166-a51b-e76e0022bd3b, 
> experiment_id=Clone_of_st2molcastest_e2942a34-c9c7-4f04-8ccb-af6fe27e0990, 
> gateway_id=seagrid - Error while environment setup
> org.apache.airavata.gfac.core.GFacException: JSch initialization error
> at 
> org.apache.airavata.gfac.impl.Factory.getSSHSession(Factory.java:545)
> at 
> org.apache.airavata.gfac.impl.HPCRemoteCluster.getSshSession(HPCRemoteCluster.java:138)
> at 
> org.apache.airavata.gfac.impl.HPCRemoteCluster.getSession(HPCRemoteCluster.java:315)
> at 
> org.apache.airavata.gfac.impl.HPCRemoteCluster.makeDirectory(HPCRemoteCluster.java:242)
> at 
> org.apache.airavata.gfac.impl.task.EnvironmentSetupTask.execute(EnvironmentSetupTask.java:51)
> at 
> org.apache.airavata.gfac.impl.GFacEngineImpl.executeTask(GFacEngineImpl.java:814)
> at 
> org.apache.airavata.gfac.impl.GFacEngineImpl.configureWorkspace(GFacEngineImpl.java:553)
> at 
> org.apache.airavata.gfac.impl.GFacEngineImpl.executeTaskListFrom(GFacEngineImpl.java:324)
> at 
> org.apache.airavata.gfac.impl.GFacEngineImpl.executeProcess(GFacEngineImpl.java:286)
> at 
> org.apache.airavata.gfac.impl.GFacWorker.executeProcess(GFacWorker.java:227)
> at org.apache.airavata.gfac.impl.GFacWorker.run(GFacWorker.java:86)
> at 
> org.apache.airavata.common.logging.MDCUtil.lambda$wrapWithMDC$0(MDCUtil.java:40)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: com.jcraft.jsch.JSchException: Auth fail
> at com.jcraft.jsch.Session.connect(Session.java:512)
> at com.jcraft.jsch.Session.connect(Session.java:183)
> at 
> 

[jira] [Commented] (AIRAVATA-2625) Derive and present Text outputs in Experiment summary

2018-01-24 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338084#comment-16338084
 ] 

Dimuthu Upeksha commented on AIRAVATA-2625:
---

This is might require few changes to output data staging task and there are few 
restrictions. Because we need to read the file to extract the values, we need 
to read the file content to the JVM heap. Risk of that is, if we read a large 
file (several gigs) into the heap, it will cause OOM error and eventually the 
whole JVM might crash. So we have to come up with a reasonable size cap for the 
files that are scanned (2 - 5 MB?) and if the file is larger than that, we 
simply ignore the file and notify the user that output file is too big to scan.

> Derive and present Text outputs in Experiment summary
> -
>
> Key: AIRAVATA-2625
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2625
> Project: Airavata
>  Issue Type: Bug
>  Components: Airavata System
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Suresh Marru
>Priority: Major
>
> Currently the available application outputs are in the form of stdout, stderr 
> and URI (files). Going forward to have text (String, Integer, Float) outputs 
> directly in the experiment summary.
> These text outputs to be displayed as key+value pairs and when defining 
> should be defined a file to derive from, e.g. stdout
> At a given time from a single file one or many of these key+value pairs could 
> be derived. Also gateway admin can define multiple files to derive multiple 
> of these values.
> These would be displayed to user in experiment summary and in detailed 
> experiment summary in admin dashboard experiment statistics.
> These should also be displayed to any other gateway user it is shared with. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2143) Experiments with overridden resource allocation details tries to use qos and reservation from community user

2018-01-29 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343809#comment-16343809
 ] 

Dimuthu Upeksha commented on AIRAVATA-2143:
---

Fixed in https://github.com/apache/airavata/pull/166

> Experiments with overridden resource allocation details tries to use qos and 
> reservation from  community user
> -
>
> Key: AIRAVATA-2143
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2143
> Project: Airavata
>  Issue Type: Bug
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> Experiments with overridden resource allocation details tries to use qos and 
> reservation from  community user.
> Remote resource login details sent at experiment creation is used to submit 
> jobs but refers to qos and reservation for the community allocation of the 
> remote resource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRAVATA-2687) Distributed agents on compute resources to communicate with Airavata server

2018-02-22 Thread Dimuthu Upeksha (JIRA)
Dimuthu Upeksha created AIRAVATA-2687:
-

 Summary: Distributed agents on compute resources to communicate 
with Airavata server 
 Key: AIRAVATA-2687
 URL: https://issues.apache.org/jira/browse/AIRAVATA-2687
 Project: Airavata
  Issue Type: New Feature
Reporter: Dimuthu Upeksha
Assignee: Dimuthu Upeksha


Currently Airavata talk to compute resources through SSH communication. However 
there are scenarios where this might not work.
 # Airavata can not setup a SSH connection when the compute resource is set 
behind a firewall or the system administrators do not allow to create a SSH 
connections to the compute resource. 
 # Doing a burst of SSH call to a particular compute resource leads to the 
false detection of Airavata server doing DoS attack on the compute resource.

As an alternative, we think of an another approach to create the communication 
between Airavata server and the compute resource. Suggestion is to install an 
agent on the compute resource and agent creates a connection to the Airavata 
server when required. This might eliminate the firewall issue as the compute 
resource is the one which initiates the connection. Then Server can send 
commands. Communication protocol might be application specific but it should 
support both request response and fire and forget model.

However there challenges,
 # How does the agent find the Airavata server
 # What happens when an agent failed? How does Airavata server knows it?
 # When Airavata server needs to execute a command on a compute resource where 
an agent is installed, how does it find the correct agent connection?

At the first stage, we need to send linux commands to be executed on the 
compute resource and transfer files through the agent.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRAVATA-2689) Distributed email clients to improve email monitoring

2018-02-22 Thread Dimuthu Upeksha (JIRA)
Dimuthu Upeksha created AIRAVATA-2689:
-

 Summary: Distributed email clients to improve email monitoring 
 Key: AIRAVATA-2689
 URL: https://issues.apache.org/jira/browse/AIRAVATA-2689
 Project: Airavata
  Issue Type: New Feature
Reporter: Dimuthu Upeksha
Assignee: Dimuthu Upeksha


Once Airavata submits a job to a compute resource, scheduler in compute 
resource sends emails about the status of the job. Content in the email is 
different to each application type so we have written a set of parsers [2] 
which can extract correct information form email messages. Airavata has an 
email monitoring system which reads those emails, parse them and perform 
necessary actions depending on the content of the emails. However this email 
monitoring system is tightly coupled into the task execution engine so we can't 
easily replicate it to have high availability.

Idea is to come up with a standalone email monitoring client that reads emails 
from a given email account, parse them and convert it into a standard message 
format. Once the message is parsed into the known message format, put it in to 
a queue ( rabbitmq, kafka) in order to be consumed by task execution engine. 
There are few non functional requirements
 # To improve the availability, we need to have more than one monitoring client 
to be running at a given time. However we need to make sure only exactly one 
client consumes a given email. So we need the coordination among email clients
 #  In future, this will be deployed as a micro service, so final packaging 
should be compatible with docker

Current email monitor implementation is this [1]. Set of parsers available 
depending on the application [2]

[1] 
[https://github.com/apache/airavata/blob/master/modules/gfac/gfac-impl/src/main/java/org/apache/airavata/gfac/monitor/email/EmailBasedMonitor.java]

[2] 
https://github.com/apache/airavata/tree/master/modules/gfac/gfac-impl/src/main/java/org/apache/airavata/gfac/monitor/email/parser



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2874) Data staging tasks should retry if a file transfer is failed

2018-08-24 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2874.
---
Resolution: Fixed

> Data staging tasks should retry if a file transfer is failed
> 
>
> Key: AIRAVATA-2874
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2874
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Reporter: Dimuthu Upeksha
>Assignee: Dimuthu Upeksha
>Priority: Major
>
> If a file transfer is failed from storage resource to compute resource or the 
> other way, Airavata should retry to transfer that file 3 times. If the 
> transfer failed in all 3 times, experiment should be marked as failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2874) Data staging tasks should retry if a file transfer is failed

2018-08-24 Thread Dimuthu Upeksha (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRAVATA-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591652#comment-16591652
 ] 

Dimuthu Upeksha commented on AIRAVATA-2874:
---

Fixed and deployed in staging environment

[1] 
https://github.com/apache/airavata/commit/c905ef59739bb9ad765457ca9baa04e6d09f882e

> Data staging tasks should retry if a file transfer is failed
> 
>
> Key: AIRAVATA-2874
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2874
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Reporter: Dimuthu Upeksha
>Assignee: Dimuthu Upeksha
>Priority: Major
>
> If a file transfer is failed from storage resource to compute resource or the 
> other way, Airavata should retry to transfer that file 3 times. If the 
> transfer failed in all 3 times, experiment should be marked as failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRAVATA-2874) Data staging tasks should retry if a file transfer is failed

2018-08-24 Thread Dimuthu Upeksha (JIRA)
Dimuthu Upeksha created AIRAVATA-2874:
-

 Summary: Data staging tasks should retry if a file transfer is 
failed
 Key: AIRAVATA-2874
 URL: https://issues.apache.org/jira/browse/AIRAVATA-2874
 Project: Airavata
  Issue Type: Bug
  Components: helix implementation
Reporter: Dimuthu Upeksha
Assignee: Dimuthu Upeksha


If a file transfer is failed from storage resource to compute resource or the 
other way, Airavata should retry to transfer that file 3 times. If the transfer 
failed in all 3 times, experiment should be marked as failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2742) Helix Controller throws an Exception when the participant is killed

2018-04-09 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430643#comment-16430643
 ] 

Dimuthu Upeksha commented on AIRAVATA-2742:
---

Tested this locally for both SIGKILL and SIGTERM commands but couldn't 
reproduce it. As a safety step, I'm updating Helix core version form 0.6.7 -> 
0.8.0. But I would suggest to extensively inspect participant restarts and the 
consistency of workflow executions in future testing iterations. Specially, 
observe the Helix Controller log

https://github.com/apache/airavata/commit/01e0e70605ea9937304458651335166e52c51d60

> Helix Controller throws an Exception when the participant is killed
> ---
>
> Key: AIRAVATA-2742
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2742
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Dimuthu Upeksha
>Assignee: Dimuthu Upeksha
>Priority: Major
>
> This was a sporadic issue and occurred only once in the test setup. There 
> were 5 - 10 tasks running in the Participant and Participant was externally 
> killed by SIGTERM command (kill . Once the Participant is started 
> again, it did not pickup the tasks that it was running at the time it was 
> killed. Surprisingly, the status of the respective workflows were IN_PROGRESS 
> status. Helix Controller log showed following error for each Workflow. This 
> seems like a bug in Helix and I posted the issue in Helix mailing list 
> (Subject : Sporadic issue when restarting a Participant). 
>  
> 2018-04-06 15:10:57,766 [Thread-3] ERROR o.a.h.c.s.BestPossibleStateCalcStage 
>  - Error computing assignment for resource 
> Workflow_of_process_PROCESS_7f6c8a54-b50f-4bdb-aafd-59ce87276527-POST-b5e39e07-2d8e-4309-be5a-f5b6067f9a24_TASK_cc8039e5-f054-4dea-8c7f-07c98077b117.
>  Skipping.
> java.lang.NullPointerException: Name is null
>         at java.lang.Enum.valueOf(Enum.java:236)
>         at 
> org.apache.helix.task.TaskPartitionState.valueOf(TaskPartitionState.java:25)
>         at 
> org.apache.helix.task.JobRebalancer.computeResourceMapping(JobRebalancer.java:272)
>         at 
> org.apache.helix.task.JobRebalancer.computeBestPossiblePartitionState(JobRebalancer.java:140)
>         at 
> org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:171)
>         at 
> org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:66)
>         at 
> org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:48)
>         at 
> org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:295)
>         at 
> org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:595)
> 2018-04-06 15:11:00,385 [Thread-3] ERROR o.a.h.c.s.BestPossibleStateCalcStage 
>  - Error computing assignment for resource 
> Workflow_of_process_PROCESS_2b69b499-c527-4c9d-8b2b-db17366f5f81-POST-c67607ae-9177-4a02-af8a-8b3751eea4ff_TASK_1ea6876d-f2ec-4139-a15d-0e64a80a3025.
>  Skipping. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2713) In helix test bed the outputs are not displayed in the experiment summary

2018-04-10 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432972#comment-16432972
 ] 

Dimuthu Upeksha commented on AIRAVATA-2713:
---

Fixed in 
https://github.com/apache/airavata/commit/bc0016f65dfb0146c92bbd76cc25cb93650748ea

> In helix test bed the outputs are not displayed in the experiment summary
> -
>
> Key: AIRAVATA-2713
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2713
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> Submitted a job in helix implementation and job and experiment both got 
> completed. The outputs exists and can view and download from storage but not 
> in the experiment summary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2743) Experiment in CANCELLED while job is still QUEUED or SUBMITTED and canceling at cluster side

2018-04-10 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432964#comment-16432964
 ] 

Dimuthu Upeksha commented on AIRAVATA-2743:
---

Rolled back to initial mode as there are some schedulers do not send emails 
once the job is cancelled

https://github.com/apache/airavata/commit/98b7d16065f946f32ccfb886ff8190c6a545c434

> Experiment in CANCELLED while job is still QUEUED or SUBMITTED and canceling 
> at cluster side
> 
>
> Key: AIRAVATA-2743
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2743
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> # Submit an experiment
>  # Cancel the experiment in PGA
>  # Experiment status changes to CANCELING
>  # Experiment status changes to CANCELLED while job is in either SUBMITTED or 
> QUEUED.
>  # Experiment status should change to CANCELLED only after the job status 
> changes to an end status (CANCELLED, COMPLETED or FAILED).
>  #



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2745) Job cancellations in the cluster should cancel the job and experiment in the gateway portal.

2018-04-10 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432966#comment-16432966
 ] 

Dimuthu Upeksha commented on AIRAVATA-2745:
---

Fixed in 
https://github.com/apache/airavata/commit/e26b66c4b5fe0912c9992ef1baefa2f364469377

> Job cancellations in the cluster should cancel the job and experiment in the 
> gateway portal.
> 
>
> Key: AIRAVATA-2745
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2745
> Project: Airavata
>  Issue Type: New Feature
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> When a user cancels the job directly in the cluster/JPC an email will be sent 
> to the monitoring. This email is sent for all slurm jobs but PBS could be 
> different and may not send this.
>  
> If airavata receives an cancel email the job should be cancelled and so is 
> the experiment irrespective of where the command executed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2736) Job submitted and running in HPC while the experiment is tagged as FAILED

2018-04-10 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432957#comment-16432957
 ] 

Dimuthu Upeksha commented on AIRAVATA-2736:
---

Fixed in 
https://github.com/apache/airavata/commit/1b950bdb5b96f046e4fbaac6e7024b158dd86e7a

> Job submitted and running in HPC while the experiment is tagged as FAILED
> -
>
> Key: AIRAVATA-2736
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2736
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: http://149.165.168.248:8008/ - Helix test env
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> # Submitted an experiment which then submitted the job.
>  # Job ID is returned and the status is ACTIVE.
>  # Due to zookeeper connection issue the experiment is FAILED.
>  # The job is still running in HPC
>  # Airavata is not waiting for job monitoring as the task status is not 
> updated in the zookeeper.
>  # error in log [1]
>  # SLM001-AmberSander-BR2_5ed5a19f-ab44-4eba-afb7-1feafaf0bbdd - exp ID
> [1]
> |org.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss for /monitoring/2159926/lock at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at 
> org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:778) at 
> org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:696)
>  at 
> org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:679)
>  at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) at 
> org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:676)
>  at 
> org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:453)
>  at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:443)
>  at 
> org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:44)
>  at 
> org.apache.airavata.helix.impl.task.submission.JobSubmissionTask.createMonitoringNode(JobSubmissionTask.java:83)
>  at 
> org.apache.airavata.helix.impl.task.submission.DefaultJobSubmissionTask.onRun(DefaultJobSubmissionTask.java:144)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:264) 
> at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:74) at 
> org.apache.helix.task.TaskRunner.run(TaskRunner.java:70) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:748)|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2735) When transferring input files, check for the file size and 0 byte files transfers should be restricted

2018-04-10 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432973#comment-16432973
 ] 

Dimuthu Upeksha commented on AIRAVATA-2735:
---

Fixed in 
https://github.com/apache/airavata/commit/0f0712382a9ce124a90939c247d2f352510e350b

> When transferring input files, check for the file size and 0 byte files 
> transfers should be restricted
> --
>
> Key: AIRAVATA-2735
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2735
> Project: Airavata
>  Issue Type: Improvement
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> # When transferring input files if the file is 0 in size, file transfer task 
> should fail and experiment should fail.
>  # User should be notified about the file being empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2742) Helix Controller throws an Exception when the participant is killed

2018-04-11 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434115#comment-16434115
 ] 

Dimuthu Upeksha commented on AIRAVATA-2742:
---

Helix Team identified this as an bug and they will fix it in future releases

https://issues.apache.org/jira/browse/HELIX-693

Helix Dev discussion - Subject: Sporadic issue when restarting a Participant

> Helix Controller throws an Exception when the participant is killed
> ---
>
> Key: AIRAVATA-2742
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2742
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Dimuthu Upeksha
>Assignee: Dimuthu Upeksha
>Priority: Major
>
> This was a sporadic issue and occurred only once in the test setup. There 
> were 5 - 10 tasks running in the Participant and Participant was externally 
> killed by SIGTERM command (kill . Once the Participant is started 
> again, it did not pickup the tasks that it was running at the time it was 
> killed. Surprisingly, the status of the respective workflows were IN_PROGRESS 
> status. Helix Controller log showed following error for each Workflow. This 
> seems like a bug in Helix and I posted the issue in Helix mailing list 
> (Subject : Sporadic issue when restarting a Participant). 
>  
> 2018-04-06 15:10:57,766 [Thread-3] ERROR o.a.h.c.s.BestPossibleStateCalcStage 
>  - Error computing assignment for resource 
> Workflow_of_process_PROCESS_7f6c8a54-b50f-4bdb-aafd-59ce87276527-POST-b5e39e07-2d8e-4309-be5a-f5b6067f9a24_TASK_cc8039e5-f054-4dea-8c7f-07c98077b117.
>  Skipping.
> java.lang.NullPointerException: Name is null
>         at java.lang.Enum.valueOf(Enum.java:236)
>         at 
> org.apache.helix.task.TaskPartitionState.valueOf(TaskPartitionState.java:25)
>         at 
> org.apache.helix.task.JobRebalancer.computeResourceMapping(JobRebalancer.java:272)
>         at 
> org.apache.helix.task.JobRebalancer.computeBestPossiblePartitionState(JobRebalancer.java:140)
>         at 
> org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:171)
>         at 
> org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:66)
>         at 
> org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:48)
>         at 
> org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:295)
>         at 
> org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:595)
> 2018-04-06 15:11:00,385 [Thread-3] ERROR o.a.h.c.s.BestPossibleStateCalcStage 
>  - Error computing assignment for resource 
> Workflow_of_process_PROCESS_2b69b499-c527-4c9d-8b2b-db17366f5f81-POST-c67607ae-9177-4a02-af8a-8b3751eea4ff_TASK_1ea6876d-f2ec-4139-a15d-0e64a80a3025.
>  Skipping. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (AIRAVATA-2747) OOM issue in Helix Participant

2018-04-11 Thread Dimuthu Upeksha (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha reassigned AIRAVATA-2747:
-

Assignee: Dimuthu Upeksha

> OOM issue in Helix Participant
> --
>
> Key: AIRAVATA-2747
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2747
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Reporter: Dimuthu Upeksha
>Assignee: Dimuthu Upeksha
>Priority: Major
> Attachments: airavata.log, threaddump-oom.log
>
>
> There seems to be an memory leak in helix participant when creating the SSH 
> sessions.
> 2018-04-11 16:06:35,916 [TaskStateModelFactory-task_thread] INFO 
> o.a.a.h.i.t.s.DataStagingTask 
> process=PROCESS_ad3fd791-a165-4e1d-bf25-cf4fa86c1970, 
> task=TASK_049812b4-5462-45dd-95a1-9c1db3a5cf73, 
> experiment=SLM001-NEK5000-BR2_08789b1b-feff-46f9-9f4b-67ee9ded280d, 
> gateway=default - Downloading output file 
> /N/dc2/scratch/cgateway/gta-work-dirs/PROCESS_ad3fd791-a165-4e1d-bf25-cf4fa86c1970/NEK5000.stdout
>  to the local path 
> /tmp/PROCESS_ad3fd791-a165-4e1d-bf25-cf4fa86c1970/temp_inputs/NEK5000.stdout
> 2018-04-11 16:06:35,929 [TaskStateModelFactory-task_thread] ERROR 
> o.apache.helix.task.TaskRunner - Problem running the task, report task as 
> FAILED.
> java.lang.OutOfMemoryError: unable to create new native thread
>  at java.lang.Thread.start0(Native Method)
>  at java.lang.Thread.start(Thread.java:717)
>  at com.jcraft.jsch.Session.connect(Session.java:528)
>  at com.jcraft.jsch.Session.connect(Session.java:183)
>  at 
> org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.init(SshAgentAdaptor.java:81)
>  at 
> org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.init(SshAgentAdaptor.java:112)
>  at 
> org.apache.airavata.helix.core.support.AdaptorSupportImpl.fetchAdaptor(AdaptorSupportImpl.java:59)
>  at 
> org.apache.airavata.helix.impl.task.submission.DefaultJobSubmissionTask.onRun(DefaultJobSubmissionTask.java:58)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:268)
>  at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:82)
>  at org.apache.helix.task.TaskRunner.run(TaskRunner.java:71)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRAVATA-2747) OOM issue in Helix Participant

2018-04-11 Thread Dimuthu Upeksha (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha updated AIRAVATA-2747:
--
Attachment: airavata.log
threaddump-oom.log

> OOM issue in Helix Participant
> --
>
> Key: AIRAVATA-2747
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2747
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Reporter: Dimuthu Upeksha
>Priority: Major
> Attachments: airavata.log, threaddump-oom.log
>
>
> There seems to be an memory leak in helix participant when creating the SSH 
> sessions.
> 2018-04-11 16:06:35,916 [TaskStateModelFactory-task_thread] INFO 
> o.a.a.h.i.t.s.DataStagingTask 
> process=PROCESS_ad3fd791-a165-4e1d-bf25-cf4fa86c1970, 
> task=TASK_049812b4-5462-45dd-95a1-9c1db3a5cf73, 
> experiment=SLM001-NEK5000-BR2_08789b1b-feff-46f9-9f4b-67ee9ded280d, 
> gateway=default - Downloading output file 
> /N/dc2/scratch/cgateway/gta-work-dirs/PROCESS_ad3fd791-a165-4e1d-bf25-cf4fa86c1970/NEK5000.stdout
>  to the local path 
> /tmp/PROCESS_ad3fd791-a165-4e1d-bf25-cf4fa86c1970/temp_inputs/NEK5000.stdout
> 2018-04-11 16:06:35,929 [TaskStateModelFactory-task_thread] ERROR 
> o.apache.helix.task.TaskRunner - Problem running the task, report task as 
> FAILED.
> java.lang.OutOfMemoryError: unable to create new native thread
>  at java.lang.Thread.start0(Native Method)
>  at java.lang.Thread.start(Thread.java:717)
>  at com.jcraft.jsch.Session.connect(Session.java:528)
>  at com.jcraft.jsch.Session.connect(Session.java:183)
>  at 
> org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.init(SshAgentAdaptor.java:81)
>  at 
> org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.init(SshAgentAdaptor.java:112)
>  at 
> org.apache.airavata.helix.core.support.AdaptorSupportImpl.fetchAdaptor(AdaptorSupportImpl.java:59)
>  at 
> org.apache.airavata.helix.impl.task.submission.DefaultJobSubmissionTask.onRun(DefaultJobSubmissionTask.java:58)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:268)
>  at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:82)
>  at org.apache.helix.task.TaskRunner.run(TaskRunner.java:71)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRAVATA-2747) OOM issue in Helix Participant

2018-04-11 Thread Dimuthu Upeksha (JIRA)
Dimuthu Upeksha created AIRAVATA-2747:
-

 Summary: OOM issue in Helix Participant
 Key: AIRAVATA-2747
 URL: https://issues.apache.org/jira/browse/AIRAVATA-2747
 Project: Airavata
  Issue Type: Bug
  Components: helix implementation
Reporter: Dimuthu Upeksha


There seems to be an memory leak in helix participant when creating the SSH 
sessions.

2018-04-11 16:06:35,916 [TaskStateModelFactory-task_thread] INFO 
o.a.a.h.i.t.s.DataStagingTask 
process=PROCESS_ad3fd791-a165-4e1d-bf25-cf4fa86c1970, 
task=TASK_049812b4-5462-45dd-95a1-9c1db3a5cf73, 
experiment=SLM001-NEK5000-BR2_08789b1b-feff-46f9-9f4b-67ee9ded280d, 
gateway=default - Downloading output file 
/N/dc2/scratch/cgateway/gta-work-dirs/PROCESS_ad3fd791-a165-4e1d-bf25-cf4fa86c1970/NEK5000.stdout
 to the local path 
/tmp/PROCESS_ad3fd791-a165-4e1d-bf25-cf4fa86c1970/temp_inputs/NEK5000.stdout
2018-04-11 16:06:35,929 [TaskStateModelFactory-task_thread] ERROR 
o.apache.helix.task.TaskRunner - Problem running the task, report task as 
FAILED.
java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:717)
 at com.jcraft.jsch.Session.connect(Session.java:528)
 at com.jcraft.jsch.Session.connect(Session.java:183)
 at 
org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.init(SshAgentAdaptor.java:81)
 at 
org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.init(SshAgentAdaptor.java:112)
 at 
org.apache.airavata.helix.core.support.AdaptorSupportImpl.fetchAdaptor(AdaptorSupportImpl.java:59)
 at 
org.apache.airavata.helix.impl.task.submission.DefaultJobSubmissionTask.onRun(DefaultJobSubmissionTask.java:58)
 at 
org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:268)
 at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:82)
 at org.apache.helix.task.TaskRunner.run(TaskRunner.java:71)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRAVATA-2750) Helix Participant is not picking up tasks after a restart

2018-04-11 Thread Dimuthu Upeksha (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha updated AIRAVATA-2750:
--
Component/s: helix implementation

> Helix Participant is not picking up tasks after a restart
> -
>
> Key: AIRAVATA-2750
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2750
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Reporter: Dimuthu Upeksha
>Priority: Major
>
> Helix Participant was restarted due to an OOM issue then it did not pickup 
> any task. By changing the participant name fixed that. Controller log
>  
> 2018-04-11 19:17:41,850 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - All partitions: [0] taskAssignment: 
> \{helixparticipant=[]} excludedInstances: []
> 2018-04-11 19:17:41,850 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Throttle tasks to be assigned to instance 
> helixparticipant using limitation: Job Concurrent Task(1), Participant Max 
> Task(40). Remaining capacity -8.
> 2018-04-11 19:17:41,859 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Job 
> Workflow_of_process_PROCESS_e14813b1-a93b-47c8-9faa-634b3cdf47b7-POST-f9e7f2c1-e3af-4f46-8740-b71289e23270_TASK_70f5baae-6e11-4448-9962-e7a964cdff37
>  new assignment []
> 2018-04-11 19:17:41,859 [GenericHelixController-event_process] DEBUG 
> o.a.h.c.s.BestPossibleStateCalcStage - Processing 
> resource:Workflow_of_process_PROCESS_c3fa99be-557a-4c25-bbb7-d4bada5d0ede-PRE-06933b15-fb89-48b9-8501-3bd4a20a1a5f_TASK_ab90e04f-a4d6-4ead-b81c-f021748f4179
> 2018-04-11 19:17:41,859 [GenericHelixController-event_process] DEBUG 
> o.a.h.c.s.BestPossibleStateCalcStage - resource 
> Workflow_of_process_PROCESS_c3fa99be-557a-4c25-bbb7-d4bada5d0ede-PRE-06933b15-fb89-48b9-8501-3bd4a20a1a5f_TASK_ab90e04f-a4d6-4ead-b81c-f021748f4179
>  use idealStateRebalancer org.apache.helix.task.JobRebalancer
> 2018-04-11 19:17:41,859 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Computer Best Partition for job: 
> Workflow_of_process_PROCESS_c3fa99be-557a-4c25-bbb7-d4bada5d0ede-PRE-06933b15-fb89-48b9-8501-3bd4a20a1a5f_TASK_ab90e04f-a4d6-4ead-b81c-f021748f4179
> 2018-04-11 19:17:41,860 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - All partitions: [0] taskAssignment: 
> \{helixparticipant=[]} excludedInstances: []
> 2018-04-11 19:17:41,861 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Throttle tasks to be assigned to instance 
> helixparticipant using limitation: Job Concurrent Task(1), Participant Max 
> Task(40). Remaining capacity -8.
> 2018-04-11 19:17:41,871 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Job 
> Workflow_of_process_PROCESS_c3fa99be-557a-4c25-bbb7-d4bada5d0ede-PRE-06933b15-fb89-48b9-8501-3bd4a20a1a5f_TASK_ab90e04f-a4d6-4ead-b81c-f021748f4179
>  new assignment []
> 2018-04-11 19:17:41,872 [GenericHelixController-event_process] DEBUG 
> o.a.h.c.s.BestPossibleStateCalcStage - Processing 
> resource:Workflow_of_process_PROCESS_5b71bc64-49f9-4bf5-801d-359dc35f58ef-POST-54334da3-d6b8-4d9f-b956-9fd943290d66_TASK_0f141d85-8633-470e-81bb-5158bf8e2ad9
> 2018-04-11 19:17:41,872 [GenericHelixController-event_process] DEBUG 
> o.a.h.c.s.BestPossibleStateCalcStage - resource 
> Workflow_of_process_PROCESS_5b71bc64-49f9-4bf5-801d-359dc35f58ef-POST-54334da3-d6b8-4d9f-b956-9fd943290d66_TASK_0f141d85-8633-470e-81bb-5158bf8e2ad9
>  use idealStateRebalancer org.apache.helix.task.JobRebalancer
> 2018-04-11 19:17:41,872 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Computer Best Partition for job: 
> Workflow_of_process_PROCESS_5b71bc64-49f9-4bf5-801d-359dc35f58ef-POST-54334da3-d6b8-4d9f-b956-9fd943290d66_TASK_0f141d85-8633-470e-81bb-5158bf8e2ad9
> 2018-04-11 19:17:41,873 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - All partitions: [0] taskAssignment: 
> \{helixparticipant=[]} excludedInstances: []
> 2018-04-11 19:17:41,873 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Throttle tasks to be assigned to instance 
> helixparticipant using limitation: Job Concurrent Task(1), Participant Max 
> Task(40). Remaining capacity -8.
> 2018-04-11 19:17:41,884 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Job 
> Workflow_of_process_PROCESS_5b71bc64-49f9-4bf5-801d-359dc35f58ef-POST-54334da3-d6b8-4d9f-b956-9fd943290d66_TASK_0f141d85-8633-470e-81bb-5158bf8e2ad9
>  new assignment []
> 2018-04-11 19:17:41,884 [GenericHelixController-event_process] DEBUG 
> o.a.h.c.s.BestPossibleStateCalcStage - Processing 
> 

[jira] [Assigned] (AIRAVATA-2750) Helix Participant is not picking up tasks after a restart

2018-04-11 Thread Dimuthu Upeksha (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha reassigned AIRAVATA-2750:
-

Assignee: Dimuthu Upeksha

> Helix Participant is not picking up tasks after a restart
> -
>
> Key: AIRAVATA-2750
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2750
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Reporter: Dimuthu Upeksha
>Assignee: Dimuthu Upeksha
>Priority: Major
>
> Helix Participant was restarted due to an OOM issue then it did not pickup 
> any task. By changing the participant name fixed that. Controller log
>  
> 2018-04-11 19:17:41,850 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - All partitions: [0] taskAssignment: 
> \{helixparticipant=[]} excludedInstances: []
> 2018-04-11 19:17:41,850 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Throttle tasks to be assigned to instance 
> helixparticipant using limitation: Job Concurrent Task(1), Participant Max 
> Task(40). Remaining capacity -8.
> 2018-04-11 19:17:41,859 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Job 
> Workflow_of_process_PROCESS_e14813b1-a93b-47c8-9faa-634b3cdf47b7-POST-f9e7f2c1-e3af-4f46-8740-b71289e23270_TASK_70f5baae-6e11-4448-9962-e7a964cdff37
>  new assignment []
> 2018-04-11 19:17:41,859 [GenericHelixController-event_process] DEBUG 
> o.a.h.c.s.BestPossibleStateCalcStage - Processing 
> resource:Workflow_of_process_PROCESS_c3fa99be-557a-4c25-bbb7-d4bada5d0ede-PRE-06933b15-fb89-48b9-8501-3bd4a20a1a5f_TASK_ab90e04f-a4d6-4ead-b81c-f021748f4179
> 2018-04-11 19:17:41,859 [GenericHelixController-event_process] DEBUG 
> o.a.h.c.s.BestPossibleStateCalcStage - resource 
> Workflow_of_process_PROCESS_c3fa99be-557a-4c25-bbb7-d4bada5d0ede-PRE-06933b15-fb89-48b9-8501-3bd4a20a1a5f_TASK_ab90e04f-a4d6-4ead-b81c-f021748f4179
>  use idealStateRebalancer org.apache.helix.task.JobRebalancer
> 2018-04-11 19:17:41,859 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Computer Best Partition for job: 
> Workflow_of_process_PROCESS_c3fa99be-557a-4c25-bbb7-d4bada5d0ede-PRE-06933b15-fb89-48b9-8501-3bd4a20a1a5f_TASK_ab90e04f-a4d6-4ead-b81c-f021748f4179
> 2018-04-11 19:17:41,860 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - All partitions: [0] taskAssignment: 
> \{helixparticipant=[]} excludedInstances: []
> 2018-04-11 19:17:41,861 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Throttle tasks to be assigned to instance 
> helixparticipant using limitation: Job Concurrent Task(1), Participant Max 
> Task(40). Remaining capacity -8.
> 2018-04-11 19:17:41,871 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Job 
> Workflow_of_process_PROCESS_c3fa99be-557a-4c25-bbb7-d4bada5d0ede-PRE-06933b15-fb89-48b9-8501-3bd4a20a1a5f_TASK_ab90e04f-a4d6-4ead-b81c-f021748f4179
>  new assignment []
> 2018-04-11 19:17:41,872 [GenericHelixController-event_process] DEBUG 
> o.a.h.c.s.BestPossibleStateCalcStage - Processing 
> resource:Workflow_of_process_PROCESS_5b71bc64-49f9-4bf5-801d-359dc35f58ef-POST-54334da3-d6b8-4d9f-b956-9fd943290d66_TASK_0f141d85-8633-470e-81bb-5158bf8e2ad9
> 2018-04-11 19:17:41,872 [GenericHelixController-event_process] DEBUG 
> o.a.h.c.s.BestPossibleStateCalcStage - resource 
> Workflow_of_process_PROCESS_5b71bc64-49f9-4bf5-801d-359dc35f58ef-POST-54334da3-d6b8-4d9f-b956-9fd943290d66_TASK_0f141d85-8633-470e-81bb-5158bf8e2ad9
>  use idealStateRebalancer org.apache.helix.task.JobRebalancer
> 2018-04-11 19:17:41,872 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Computer Best Partition for job: 
> Workflow_of_process_PROCESS_5b71bc64-49f9-4bf5-801d-359dc35f58ef-POST-54334da3-d6b8-4d9f-b956-9fd943290d66_TASK_0f141d85-8633-470e-81bb-5158bf8e2ad9
> 2018-04-11 19:17:41,873 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - All partitions: [0] taskAssignment: 
> \{helixparticipant=[]} excludedInstances: []
> 2018-04-11 19:17:41,873 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Throttle tasks to be assigned to instance 
> helixparticipant using limitation: Job Concurrent Task(1), Participant Max 
> Task(40). Remaining capacity -8.
> 2018-04-11 19:17:41,884 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Job 
> Workflow_of_process_PROCESS_5b71bc64-49f9-4bf5-801d-359dc35f58ef-POST-54334da3-d6b8-4d9f-b956-9fd943290d66_TASK_0f141d85-8633-470e-81bb-5158bf8e2ad9
>  new assignment []
> 2018-04-11 19:17:41,884 [GenericHelixController-event_process] DEBUG 
> o.a.h.c.s.BestPossibleStateCalcStage - Processing 
> 

[jira] [Commented] (AIRAVATA-2743) Experiment in CANCELLED while job is still QUEUED or SUBMITTED and canceling at cluster side

2018-04-09 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431625#comment-16431625
 ] 

Dimuthu Upeksha commented on AIRAVATA-2743:
---

Fixed in 
https://github.com/apache/airavata/commit/f912d39d37e85d0ac9b3a5c4a027714d17e208f2

> Experiment in CANCELLED while job is still QUEUED or SUBMITTED and canceling 
> at cluster side
> 
>
> Key: AIRAVATA-2743
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2743
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> # Submit an experiment
>  # Cancel the experiment in PGA
>  # Experiment status changes to CANCELING
>  # Experiment status changes to CANCELLED while job is in either SUBMITTED or 
> QUEUED.
>  # Experiment status should change to CANCELLED only after the job status 
> changes to an end status (CANCELLED, COMPLETED or FAILED).
>  #



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRAVATA-2742) Helix Controller throws an Exception when the participant is killed

2018-04-09 Thread Dimuthu Upeksha (JIRA)
Dimuthu Upeksha created AIRAVATA-2742:
-

 Summary: Helix Controller throws an Exception when the participant 
is killed
 Key: AIRAVATA-2742
 URL: https://issues.apache.org/jira/browse/AIRAVATA-2742
 Project: Airavata
  Issue Type: Bug
  Components: helix implementation
Affects Versions: 0.18
Reporter: Dimuthu Upeksha


This was a sporadic issue and occurred only once in the test setup. There were 
5 - 10 tasks running in the Participant and Participant was externally killed 
by SIGTERM command (kill . Once the Participant is started again, 
it did not pickup the tasks that it was running at the time it was killed. 
Surprisingly, the status of the respective workflows were IN_PROGRESS status. 
Helix Controller log showed following error for each Workflow. This seems like 
a bug in Helix and I posted the issue in Helix mailing list (Subject : Sporadic 
issue when restarting a Participant). 

 
2018-04-06 15:10:57,766 [Thread-3] ERROR o.a.h.c.s.BestPossibleStateCalcStage  
- Error computing assignment for resource 
Workflow_of_process_PROCESS_7f6c8a54-b50f-4bdb-aafd-59ce87276527-POST-b5e39e07-2d8e-4309-be5a-f5b6067f9a24_TASK_cc8039e5-f054-4dea-8c7f-07c98077b117.
 Skipping.
java.lang.NullPointerException: Name is null
        at java.lang.Enum.valueOf(Enum.java:236)
        at 
org.apache.helix.task.TaskPartitionState.valueOf(TaskPartitionState.java:25)
        at 
org.apache.helix.task.JobRebalancer.computeResourceMapping(JobRebalancer.java:272)
        at 
org.apache.helix.task.JobRebalancer.computeBestPossiblePartitionState(JobRebalancer.java:140)
        at 
org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:171)
        at 
org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:66)
        at 
org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:48)
        at 
org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:295)
        at 
org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:595)
2018-04-06 15:11:00,385 [Thread-3] ERROR o.a.h.c.s.BestPossibleStateCalcStage  
- Error computing assignment for resource 
Workflow_of_process_PROCESS_2b69b499-c527-4c9d-8b2b-db17366f5f81-POST-c67607ae-9177-4a02-af8a-8b3751eea4ff_TASK_1ea6876d-f2ec-4139-a15d-0e64a80a3025.
 Skipping. 
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2467) When given *.chk as the value for the output file in an application. File(s) with that extension is not listed in the summary.

2018-04-07 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429547#comment-16429547
 ] 

Dimuthu Upeksha commented on AIRAVATA-2467:
---

Fixed in [https://github.com/apache/airavata/pull/190]

> When given *.chk as the value for the output file in an application. File(s) 
> with that extension is not listed in the summary.
> --
>
> Key: AIRAVATA-2467
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2467
> Project: Airavata
>  Issue Type: Bug
>  Components: PGA PHP Web Gateway
>Affects Versions: 0.17
> Environment: https://dev.seagrid.org
>Reporter: Eroma
>Assignee: Supun Chathuranga Nakandala
>Priority: Major
> Fix For: 0.18
>
>
> When * is given the system takes * as the name of the file. Not to bring all 
> the files with .chk extension. Adding * should mean to bring back all the 
> files with the extension provided.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2740) Non-existing file transfer has failed the experiment

2018-04-06 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428425#comment-16428425
 ] 

Dimuthu Upeksha commented on AIRAVATA-2740:
---

Fixed in 

[https://github.com/apache/airavata/commit/721a55abf5b7dff1260f4fd23395003b5460f5e0]

[https://github.com/apache/airavata/commit/e45108a00b3542907fc5494a2a4ef288a1fa9e3b]

https://github.com/apache/airavata/commit/7bb426a243135e97cda181850fa6b48f1d5e059d

> Non-existing file transfer has failed the experiment
> 
>
> Key: AIRAVATA-2740
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2740
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: http://149.165.168.248:8008/
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
>
> # Due to an 0 byte input file upload tan expected output file is not 
> generated in the application execution. 
>  # The airavata tries to transfer the file an error is thrown as the file is 
> not existing.
>  # But a 0 byte output file is created in the gateway data storage and user 
> can view it.
>  # Exception is thrown [1]
>  # If a specified output file is not in the working directory, it should be 
> ignored, not create an empty file in the storage
>  # If a 0 byte file exists in the working directory it should be transferred 
> to the storage.
> [1]
> org.apache.airavata.agents.api.AgentException: java.io.FileNotFoundException: 
> /tmp/PROCESS_ac1064b1-226e-491c-91e6-303448f05f16/temp_inputs/Gaussian.log 
> (No such file or directory) at 
> org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.copyFileTo(SshAgentAdaptor.java:307)
>  at 
> org.apache.airavata.helix.agent.storage.StorageResourceAdaptorImpl.uploadFile(StorageResourceAdaptorImpl.java:98)
>  at 
> org.apache.airavata.helix.impl.task.staging.DataStagingTask.transferFileToStorage(DataStagingTask.java:158)
>  at 
> org.apache.airavata.helix.impl.task.staging.OutputDataStagingTask.onRun(OutputDataStagingTask.java:163)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:265) 
> at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:82) at 
> org.apache.helix.task.TaskRunner.run(TaskRunner.java:70) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:748) Caused by: 
> java.io.FileNotFoundException: 
> /tmp/PROCESS_ac1064b1-226e-491c-91e6-303448f05f16/temp_inputs/Gaussian.log 
> (No such file or directory) at java.io.FileInputStream.open0(Native Method) 
> at java.io.FileInputStream.open(FileInputStream.java:195) at 
> java.io.FileInputStream.(FileInputStream.java:138) at 
> java.io.FileInputStream.(FileInputStream.java:93) at 
> org.apache.airavata.helix.agent.ssh.SshAgentAdaptor.copyFileTo(SshAgentAdaptor.java:276)
>  ... 13 more



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2734) Experiment status in LAUNCEHD while job is in ACTIVE. Experiment status should be EXECUTING.

2018-04-03 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424532#comment-16424532
 ] 

Dimuthu Upeksha commented on AIRAVATA-2734:
---

Fixed in 
https://github.com/apache/airavata/commit/bf3943a37fc182e7ad884c9683e8563f4bc29d5b

> Experiment status in LAUNCEHD while job is in ACTIVE. Experiment status 
> should be EXECUTING.
> 
>
> Key: AIRAVATA-2734
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2734
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> Experiment status should change to EXECUTING when it is picked up by helix. 
> Once the status changes to EXECUTING the job status will get changed to 
> SUBMITTED, QUEUED and ACTIVE.
> Once the job is COMPLETED, experiment status will change to COMPLETED after 
> the output files transfers are completed.
>  
> Currently experiment status is LAUNCHED but the job is submitted and running. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2737) Too many Zookeeper connections created

2018-04-03 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424534#comment-16424534
 ] 

Dimuthu Upeksha commented on AIRAVATA-2737:
---

Fixed in 
https://github.com/apache/airavata/commit/8f7dc3dc8889bd21cb00911d323a66721a960c81

> Too many Zookeeper connections created
> --
>
> Key: AIRAVATA-2737
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2737
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> For each task in a workflow a zookeeper connection is opened. This creates 
> too many zookeeper connections and some experiments are not moving pass 
> LAUNCHED as a result



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2733) Improvements to Helix log messages

2018-04-03 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424528#comment-16424528
 ] 

Dimuthu Upeksha commented on AIRAVATA-2733:
---

Fixed in

[https://github.com/apache/airavata/commit/8f7dc3dc8889bd21cb00911d323a66721a960c81]

https://github.com/apache/airavata/commit/55747caf5f11ebfb2507d96e83f61a9938ceb857

> Improvements to Helix log messages
> --
>
> Key: AIRAVATA-2733
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2733
> Project: Airavata
>  Issue Type: Improvement
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> New additions to the current Helix log messages
>  # Add the job submission command to the log. Currently it is not there and 
> only the job status is there.
>  # Print the complete job submission response from the cluster, this is 
> useful when an experiment and/or job fails to investigate.
>  # Print both token and description on the log for the credential store token 
> in use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2717) [GSoC] Resurrect User-Defined Airavata Workflows

2018-03-22 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409595#comment-16409595
 ] 

Dimuthu Upeksha commented on AIRAVATA-2717:
---

Hi Yasas,

It is bit outdated but you can get an idea by looking at the document. Recently 
we have developed a task execution framework on top of Apache Helix which 
satisfies most of the requirements mentioned in the document. You can refer to 
the discussion thread [1] under the subject "Evaluating Helix as the task 
execution framework" to have a better understanding of the design. You can see 
currently implemented tasks from [2]. However currently, we statically bind the 
tasks into a workflow [3]. Workflow is just a sequence of tasks. Order of these 
tasks is pre defined and embedded into the current orchestrator code. For 
example we run in the order of Environment setup tasks ->  Input Data Staging 
Tasks -> Job Submission Task so on. In future we should be able to have more 
flexibility. Rather than statically defining order of tasks inside the 
orchestrator code, we should be able give that order from the outside. That's 
why we need a workflow language to interpret it. 

Ideally the flow would be

Create a workflow description -> submit to orchestrator -> orchestrator parses 
the workflow -> Submit to task execution engine

To create the workflow you can simply write the workflow in a text file, or use 
a GUI to generate it. If you follow the mail thread that I have mentioned, 
there is a image that illustrate such GUI tool.

If you need more information, let's move to dev mailing list

[1] [http://mail-archives.apache.org/mod_mbox/airavata-dev/201711.mbox/thread]

[2] 
[https://github.com/apache/airavata/tree/develop/modules/airavata-helix/helix-spectator/src/main/java/org/apache/airavata/helix/impl/task]

[3] 
[https://github.com/apache/airavata/blob/develop/modules/airavata-helix/helix-spectator/src/main/java/org/apache/airavata/helix/impl/workflow/PreWorkflowManager.java#L83]

Thanks

Dimuthu

> [GSoC] Resurrect User-Defined Airavata Workflows 
> -
>
> Key: AIRAVATA-2717
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2717
> Project: Airavata
>  Issue Type: Epic
>Affects Versions: 0.17
>Reporter: Suresh Marru
>Priority: Major
>  Labels: gsoc2018
>
> Airavata used to support user-defined workflows using an interface XBaya to 
> drag and drop application components to a workspace and define data flow and 
> control flow dependencies among the application nodes. Airavata's workflow 
> system was used for composing, executing, and monitoring workflow graphs 
> primarly web service components. The workflow description was high level 
> abstraction and used to be converted to lower level execution run times like 
> BPEL, SCUFL and Python scripts.
>  
> Airavata has evolved significantly and the current development version is 
> being built-over Apache Helix for DAG orchestration. This provides an 
> opportunity to resurrect workflow capabilities in Airavata. 
> This GSoC project involves finalizing a Airavata Workflow Language; modify 
> the orchestrator to parse user described workflow and translate to equivalent 
> Helix DAG's; execute and monitor the worklfows; develop a simple UI to 
> demonstrate the capabilities. 
> To describe the workflows, you can build on this - 
> [https://docs.google.com/document/d/1eh7BV8CHupxyM2jeqcM2tUG5MnXFt7hNDX4PQDfxCcM/edit]
>  or follow other discussions like - 
> https://issues.apache.org/jira/browse/AIRAVATA-2555 and 
> User community & Impact of the software: Airavata is primarily targeted to 
> build science gateways using computational resources from various 
> disciplines. The initial targeted set of gateways include projects supporting 
> research and education in chemistry, biophysics, and geosciences . The goal 
> of airavata is to enhance productivity of these gateways to utilize 
> cyberinfrastructure of resources (e.g., local lab resources, the Extreme 
> Science and Engineering Discovery Environment (XSEDE), University Clusters, 
> Academic and Commercial Computational Clouds. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2718) [GSoC] Re-architect Output Data Parsing into Airavata core

2018-03-22 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409608#comment-16409608
 ] 

Dimuthu Upeksha commented on AIRAVATA-2718:
---

Hi Lahiru,

Thanks for your interest. One possible architecture for generalizing data 
parsers can be found from [1]. However you are free to come up with your own 
design but try to utilize the current task execution framework. You can have a 
good insight of the task framework by referring to my comment in [2]. If you 
need further clarifications, let's discuss on dev list

[1] 
[https://docs.google.com/presentation/d/1CiPLE6Ht9ynNC9R9Bk0U7yHlsqw2g8ONTDxxDs6R_MY/edit?usp=sharing]

[2] https://issues.apache.org/jira/browse/AIRAVATA-2717

Thanks

Dimuthu

> [GSoC] Re-architect Output Data Parsing into Airavata core
> --
>
> Key: AIRAVATA-2718
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2718
> Project: Airavata
>  Issue Type: Epic
>Reporter: Suresh Marru
>Priority: Major
>
> As discussed in this paper [1]  Airavata based SEAGrid gateway has prototyped 
> a data catalog system [2]. [3] and [4] are also related references. The new 
> airavata execution architecture in develop branch is based on Apache Helix. 
> This provides an opportunity to re-architect the data catalog and build it on 
> new Helix DAG based execution within Airavata. 
> This project involves 
>  * the data parsers as Airavata tasks and execute them as Helix DAG's. 
>  * Incorporate the MongoDB based search and catalog registry and explore 
> Thrift API's.
>  * Modify the current simple UI into the new Django portal.
>  * Generalize the data catalog. 
>  * Publish a paper [optional]
> [1] - 
> [https://pdfs.semanticscholar.org/2938/686c5c7eecb1b82ce8064b30555298bd649e.pdf]
> [2] - https://github.com/SciGaP/seagrid-data
> [[3] - 
> https://www.researchgate.net/profile/Suresh_Marru/publication/275948320_Scientific_Data_Cataloging_System/links/554a05680cf2e859ce18afb4.pdf|https://www.researchgate.net/profile/Suresh_Marru/publication/275948320_Scientific_Data_Cataloging_System/links/554a05680cf2e859ce18afb4.pdf]
> [[4] - 
> https://www.researchgate.net/profile/Dilum_Bandara/publication/282989239_Schema-independent_scientific_data_cataloging_framework/links/5653a40508aeafc2aabb59e8/Schema-independent-scientific-data-cataloging-framework.pdf|https://www.researchgate.net/profile/Dilum_Bandara/publication/282989239_Schema-independent_scientific_data_cataloging_framework/links/5653a40508aeafc2aabb59e8/Schema-independent-scientific-data-cataloging-framework.pdf]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2717) [GSoC] Resurrect User-Defined Airavata Workflows

2018-03-23 Thread Dimuthu Upeksha (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRAVATA-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411604#comment-16411604
 ] 

Dimuthu Upeksha commented on AIRAVATA-2717:
---

[~marcuschristie] Yeah I guess that's what we want. Sorry for my 
misrepresentation. Let's use a higher level workflow to connect applications 
rather than composing everything in a single workflow. 

> [GSoC] Resurrect User-Defined Airavata Workflows 
> -
>
> Key: AIRAVATA-2717
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2717
> Project: Airavata
>  Issue Type: Epic
>Affects Versions: 0.17
>Reporter: Suresh Marru
>Priority: Major
>  Labels: gsoc2018
>
> Airavata used to support user-defined workflows using an interface XBaya to 
> drag and drop application components to a workspace and define data flow and 
> control flow dependencies among the application nodes. Airavata's workflow 
> system was used for composing, executing, and monitoring workflow graphs 
> primarly web service components. The workflow description was high level 
> abstraction and used to be converted to lower level execution run times like 
> BPEL, SCUFL and Python scripts.
>  
> Airavata has evolved significantly and the current development version is 
> being built-over Apache Helix for DAG orchestration. This provides an 
> opportunity to resurrect workflow capabilities in Airavata. 
> This GSoC project involves finalizing a Airavata Workflow Language; modify 
> the orchestrator to parse user described workflow and translate to equivalent 
> Helix DAG's; execute and monitor the worklfows; develop a simple UI to 
> demonstrate the capabilities. 
> To describe the workflows, you can build on this - 
> [https://docs.google.com/document/d/1eh7BV8CHupxyM2jeqcM2tUG5MnXFt7hNDX4PQDfxCcM/edit]
>  or follow other discussions like - 
> https://issues.apache.org/jira/browse/AIRAVATA-2555 and 
> User community & Impact of the software: Airavata is primarily targeted to 
> build science gateways using computational resources from various 
> disciplines. The initial targeted set of gateways include projects supporting 
> research and education in chemistry, biophysics, and geosciences . The goal 
> of airavata is to enhance productivity of these gateways to utilize 
> cyberinfrastructure of resources (e.g., local lab resources, the Extreme 
> Science and Engineering Discovery Environment (XSEDE), University Clusters, 
> Academic and Commercial Computational Clouds. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRAVATA-2940) Sporadic JPA errors when invoking Registry Server APIs

2018-11-12 Thread Dimuthu Upeksha (JIRA)
Dimuthu Upeksha created AIRAVATA-2940:
-

 Summary: Sporadic JPA errors when invoking Registry Server APIs
 Key: AIRAVATA-2940
 URL: https://issues.apache.org/jira/browse/AIRAVATA-2940
 Project: Airavata
  Issue Type: Bug
  Components: Registry API
Affects Versions: 0.17
 Environment: staging
Reporter: Dimuthu Upeksha
Assignee: Dimuthu Upeksha


This issue occurs randomly at different registry components. It seems like a 
general JPA bug or a misuse of JPA APIs in registry code. 

2018-11-10 18:29:28,003 [pool-10-thread-208241] ERROR 
o.a.a.r.c.a.c.i.ApplicationDeploymentImpl - Error while retrieving application 
deployment...
org.apache.airavata.registry.cpi.AppCatalogException: 
 
org.apache.openjpa.persistence.InvalidStateException: The context has been 
closed. The stack trace at which the context was closed is available if 
Runtime=TRACE logging is enabled.
 at 
org.apache.airavata.registry.core.app.catalog.resources.LibraryApendPathResource.get(LibraryApendPathResource.java:214)
 at 
org.apache.airavata.registry.core.app.catalog.util.AppCatalogThriftConversion.getApplicationDeploymentDescription(AppCatalogThriftConversion.java:758)
 at 
org.apache.airavata.registry.core.app.catalog.impl.ApplicationDeploymentImpl.getApplicationDeployement(ApplicationDeploymentImpl.java:326)
 at 
org.apache.airavata.registry.api.service.handler.RegistryServerHandler.getApplicationDeployment(RegistryServerHandler.java:1211)
 at 
org.apache.airavata.registry.api.RegistryService$Processor$getApplicationDeployment.getResult(RegistryService.java:14835)
 at 
org.apache.airavata.registry.api.RegistryService$Processor$getApplicationDeployment.getResult(RegistryService.java:14819)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.openjpa.persistence.InvalidStateException: The context 
has been closed. The stack trace at which the context was closed is available 
if Runtime=TRACE logging is enabled.
 at org.apache.openjpa.kernel.BrokerImpl.assertOpen(BrokerImpl.java:4676)
 at org.apache.openjpa.kernel.BrokerImpl.beginOperation(BrokerImpl.java:1930)
 at org.apache.openjpa.kernel.BrokerImpl.commit(BrokerImpl.java:1503)
 at org.apache.openjpa.kernel.DelegatingBroker.commit(DelegatingBroker.java:933)
 at 
org.apache.openjpa.persistence.EntityManagerImpl.commit(EntityManagerImpl.java:570)
 at 
org.apache.airavata.registry.core.app.catalog.resources.LibraryApendPathResource.get(LibraryApendPathResource.java:205)
 ... 11 common frames omitted
2018-11-10 18:29:28,003 [pool-10-thread-208241] ERROR 
o.a.a.r.a.s.h.RegistryServerHandler - 
comet.sdsc.edu_Ultrascan_0091a13a-1fe5-41cf-8708-79a987e3021a
org.apache.airavata.registry.cpi.AppCatalogException: 
org.apache.airavata.registry.cpi.AppCatalogException: 
 
org.apache.openjpa.persistence.InvalidStateException: The context has been 
closed. The stack trace at which the context was closed is available if 
Runtime=TRACE logging is enabled.
 at 
org.apache.airavata.registry.core.app.catalog.impl.ApplicationDeploymentImpl.getApplicationDeployement(ApplicationDeploymentImpl.java:329)
 at 
org.apache.airavata.registry.api.service.handler.RegistryServerHandler.getApplicationDeployment(RegistryServerHandler.java:1211)
 at 
org.apache.airavata.registry.api.RegistryService$Processor$getApplicationDeployment.getResult(RegistryService.java:14835)
 at 
org.apache.airavata.registry.api.RegistryService$Processor$getApplicationDeployment.getResult(RegistryService.java:14819)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.airavata.registry.cpi.AppCatalogException: 
 
org.apache.openjpa.persistence.InvalidStateException: The context has been 
closed. The stack trace at which the context was closed is available if 
Runtime=TRACE logging is enabled.
 at 
org.apache.airavata.registry.core.app.catalog.resources.LibraryApendPathResource.get(LibraryApendPathResource.java:214)
 at 
org.apache.airavata.registry.core.app.catalog.util.AppCatalogThriftConversion.getApplicationDeploymentDescription(AppCatalogThriftConversion.java:758)
 at 

[jira] [Commented] (AIRAVATA-2940) Sporadic JPA errors when invoking Registry Server APIs

2018-11-12 Thread Dimuthu Upeksha (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRAVATA-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684215#comment-16684215
 ] 

Dimuthu Upeksha commented on AIRAVATA-2940:
---

Still couldn't identify the cause for the issue but retrying on the API gives 
the result. So fixed in helix side to retry if an API call is failed

https://github.com/apache/airavata/commit/274c73ffcc226daabfbe213a27b8f10ad53dac0b

> Sporadic JPA errors when invoking Registry Server APIs
> --
>
> Key: AIRAVATA-2940
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2940
> Project: Airavata
>  Issue Type: Bug
>  Components: Registry API
>Affects Versions: 0.17
> Environment: staging
>Reporter: Dimuthu Upeksha
>Assignee: Dimuthu Upeksha
>Priority: Major
>
> This issue occurs randomly at different registry components. It seems like a 
> general JPA bug or a misuse of JPA APIs in registry code. 
> 2018-11-10 18:29:28,003 [pool-10-thread-208241] ERROR 
> o.a.a.r.c.a.c.i.ApplicationDeploymentImpl - Error while retrieving 
> application deployment...
> org.apache.airavata.registry.cpi.AppCatalogException: 
>  
> org.apache.openjpa.persistence.InvalidStateException: The context has been 
> closed. The stack trace at which the context was closed is available if 
> Runtime=TRACE logging is enabled.
>  at 
> org.apache.airavata.registry.core.app.catalog.resources.LibraryApendPathResource.get(LibraryApendPathResource.java:214)
>  at 
> org.apache.airavata.registry.core.app.catalog.util.AppCatalogThriftConversion.getApplicationDeploymentDescription(AppCatalogThriftConversion.java:758)
>  at 
> org.apache.airavata.registry.core.app.catalog.impl.ApplicationDeploymentImpl.getApplicationDeployement(ApplicationDeploymentImpl.java:326)
>  at 
> org.apache.airavata.registry.api.service.handler.RegistryServerHandler.getApplicationDeployment(RegistryServerHandler.java:1211)
>  at 
> org.apache.airavata.registry.api.RegistryService$Processor$getApplicationDeployment.getResult(RegistryService.java:14835)
>  at 
> org.apache.airavata.registry.api.RegistryService$Processor$getApplicationDeployment.getResult(RegistryService.java:14819)
>  at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>  at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>  at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.openjpa.persistence.InvalidStateException: The context 
> has been closed. The stack trace at which the context was closed is available 
> if Runtime=TRACE logging is enabled.
>  at org.apache.openjpa.kernel.BrokerImpl.assertOpen(BrokerImpl.java:4676)
>  at org.apache.openjpa.kernel.BrokerImpl.beginOperation(BrokerImpl.java:1930)
>  at org.apache.openjpa.kernel.BrokerImpl.commit(BrokerImpl.java:1503)
>  at 
> org.apache.openjpa.kernel.DelegatingBroker.commit(DelegatingBroker.java:933)
>  at 
> org.apache.openjpa.persistence.EntityManagerImpl.commit(EntityManagerImpl.java:570)
>  at 
> org.apache.airavata.registry.core.app.catalog.resources.LibraryApendPathResource.get(LibraryApendPathResource.java:205)
>  ... 11 common frames omitted
> 2018-11-10 18:29:28,003 [pool-10-thread-208241] ERROR 
> o.a.a.r.a.s.h.RegistryServerHandler - 
> comet.sdsc.edu_Ultrascan_0091a13a-1fe5-41cf-8708-79a987e3021a
> org.apache.airavata.registry.cpi.AppCatalogException: 
> org.apache.airavata.registry.cpi.AppCatalogException: 
>  
> org.apache.openjpa.persistence.InvalidStateException: The context has been 
> closed. The stack trace at which the context was closed is available if 
> Runtime=TRACE logging is enabled.
>  at 
> org.apache.airavata.registry.core.app.catalog.impl.ApplicationDeploymentImpl.getApplicationDeployement(ApplicationDeploymentImpl.java:329)
>  at 
> org.apache.airavata.registry.api.service.handler.RegistryServerHandler.getApplicationDeployment(RegistryServerHandler.java:1211)
>  at 
> org.apache.airavata.registry.api.RegistryService$Processor$getApplicationDeployment.getResult(RegistryService.java:14835)
>  at 
> org.apache.airavata.registry.api.RegistryService$Processor$getApplicationDeployment.getResult(RegistryService.java:14819)
>  at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>  at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>  at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> 

[jira] [Resolved] (AIRAVATA-2833) Several experiments failed at various stages of job submission due to connection lost

2018-09-21 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2833.
---
Resolution: Fixed

Added job submission retrying logic

> Several experiments failed at various stages of job submission due to 
> connection lost
> -
>
> Key: AIRAVATA-2833
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2833
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://staging.seagrid.org/
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> While submitting a batch of jobs, several failed in a single cluster due to 
> connection lost.
> Experiment has failed at uploading input file, output transfer and creating 
> archive.tar. Error in log [1]. Anything we could do here? Try again? resubmit 
> the task?
>  
>  
> Exi ID: 
> SLM001-QEspresso-JS:2_d01e50dd-74fe-434a-87b3-e4668b827da5
> SLM001-QEspresso-JS:1_b29c6476-8944-4f6d-8946-b2e9f20b2acf
> SLM001-QEspresso-JS:0_cd3d980d-017e-4ebe-91f7-85d1157feb94
>  
> [1]
> org.apache.airavata.helix.impl.task.TaskOnFailException: Error Code : 
> cc1c8295-e5ec-44bf-b705-eceddfca3b1a, Task 
> TASK_b6ea333e-7468-4221-8b87-09050d7d053c failed due to Failed uploading the 
> input file to 
> /N/SEAGrid_scratch/PROCESS_1694a674-3dd7-4693-868e-b7fd2b8d/ from local 
> path 
> /tmp/PROCESS_1694a674-3dd7-4693-868e-b7fd2b8d/temp_inputs/Al.sample1.in, 
> net.schmizz.sshj.connection.ConnectionException: [CONNECTION_LOST] Did not 
> receive any keep-alive response for 25 seconds at 
> org.apache.airavata.helix.impl.task.AiravataTask.onFail(AiravataTask.java:102)
>  at 
> org.apache.airavata.helix.impl.task.staging.InputDataStagingTask.onRun(InputDataStagingTask.java:137)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:311) 
> at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:90) at 
> org.apache.helix.task.TaskRunner.run(TaskRunner.java:71) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Caused by: 
> org.apache.airavata.agents.api.AgentException: 
> net.schmizz.sshj.connection.ConnectionException: [CONNECTION_LOST] Did not 
> receive any keep-alive response for 25 seconds at 
> org.apache.airavata.helix.adaptor.SSHJAgentAdaptor.copyFileTo(SSHJAgentAdaptor.java:155)
>  at 
> org.apache.airavata.helix.impl.task.staging.InputDataStagingTask.onRun(InputDataStagingTask.java:119)
>  ... 10 more Caused by: net.schmizz.sshj.connection.ConnectionException: 
> [CONNECTION_LOST] Did not receive any keep-alive response for 25 seconds at 
> net.schmizz.keepalive.KeepAliveRunner.checkMaxReached(KeepAliveRunner.java:64)
>  at 
> net.schmizz.keepalive.KeepAliveRunner.doKeepAlive(KeepAliveRunner.java:56) at 
> net.schmizz.keepalive.KeepAlive.run(KeepAlive.java:63)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2831) Experiment FAILED with an error on output file staging! But the file referring in the error is actually downloaded and available in storage.

2018-09-21 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2831.
---
Resolution: Fixed

This should be fixed after data staging retrying implementation

> Experiment FAILED with an error on output file staging! But the file 
> referring in the error is actually downloaded and available in storage.
> 
>
> Key: AIRAVATA-2831
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2831
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://staging.seagrid.org/
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> # When experiments were launched and jobs were submitted bot real time 
> monitoring and email monitoring was stopped.
>  # Started realtime monitoring and then the job statuses got updated 
> correctly.
>  # Then stopped the realtime monitoring and started email monitoing.
>  # Job statuses got updated correctly but experiment status of some are 
> FAILED with error [1]
>  # But the file is already transfered.
>  # exp ID: SLM005-QEspresso-JS:2_1fec2375-945b-4b21-8157-5e91b1391312 and job 
> iD: 237.torque-server
> [1]
> |org.apache.airavata.helix.impl.task.TaskOnFailException: Error Code : 
> 01ee4646-2139-40b8-840e-348e37b1823f, Task 
> TASK_f5726ea4-638f-4c41-9904-0b3c766fcaee failed due to Error while checking 
> the file 
> /N/SEAGrid_scratch//PROCESS_f0192239-787a-4f8f-b63e-7cb45a837f4a/Quantum_Espresso.stdout
>  existence, net.schmizz.sshj.connection.ConnectionException: 
> [CONNECTION_LOST] Did not receive any keep-alive response for 25 seconds at 
> org.apache.airavata.helix.impl.task.AiravataTask.onFail(AiravataTask.java:102)
>  at 
> org.apache.airavata.helix.impl.task.staging.OutputDataStagingTask.onRun(OutputDataStagingTask.java:187)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:311) 
> at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:90) at 
> org.apache.helix.task.TaskRunner.run(TaskRunner.java:71) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Caused by: 
> org.apache.airavata.agents.api.AgentException: 
> net.schmizz.sshj.connection.ConnectionException: [CONNECTION_LOST] Did not 
> receive any keep-alive response for 25 seconds at 
> org.apache.airavata.helix.adaptor.SSHJAgentAdaptor.doesFileExist(SSHJAgentAdaptor.java:183)
>  at 
> org.apache.airavata.helix.impl.task.staging.DataStagingTask.transferFileToStorage(DataStagingTask.java:141)
>  at 
> org.apache.airavata.helix.impl.task.staging.OutputDataStagingTask.onRun(OutputDataStagingTask.java:172)
>  ... 10 more Caused by: net.schmizz.sshj.connection.ConnectionException: 
> [CONNECTION_LOST] Did not receive any keep-alive response for 25 seconds at 
> net.schmizz.keepalive.KeepAliveRunner.checkMaxReached(KeepAliveRunner.java:64)
>  at 
> net.schmizz.keepalive.KeepAliveRunner.doKeepAlive(KeepAliveRunner.java:56) at 
> net.schmizz.keepalive.KeepAlive.run(KeepAlive.java:63)|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2826) Helix participant server was stopped and started while experiments are launched and job submissions to Jetstream cluster failed

2018-09-21 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2826.
---
Resolution: Fixed

> Helix participant server was stopped and started while experiments are 
> launched and job submissions to Jetstream cluster failed
> ---
>
> Key: AIRAVATA-2826
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2826
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://staging.seagrid.org/
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> # Experiments started launching while helix participant stopped and started.
>  # When the helix participant was started particularly jobs to Jetstream 
> failed.
>  # Job submission failed due to environment set up failed in jetstream with 
> error [1] 
> [1]
> org.apache.airavata.helix.impl.task.TaskOnFailException: Error Code : 
> 658d46e9-b08b-46c0-9701-4bf5eeb23134, Task 
> TASK_f4e3eccf-3e03-4d34-9cf0-7028efd09a40 failed due to Failed to setup 
> environment of task TASK_f4e3eccf-3e03-4d34-9cf0-7028efd09a40, 
> net.schmizz.sshj.connection.ConnectionException: [CONNECTION_LOST] Did not 
> receive any keep-alive response for 25 seconds at 
> org.apache.airavata.helix.impl.task.AiravataTask.onFail(AiravataTask.java:102)
>  at 
> org.apache.airavata.helix.impl.task.env.EnvSetupTask.onRun(EnvSetupTask.java:55)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:311) 
> at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:90) at 
> org.apache.helix.task.TaskRunner.run(TaskRunner.java:71) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Caused by: 
> org.apache.airavata.agents.api.AgentException: 
> net.schmizz.sshj.connection.ConnectionException: [CONNECTION_LOST] Did not 
> receive any keep-alive response for 25 seconds at 
> org.apache.airavata.helix.adaptor.SSHJAgentAdaptor.createDirectory(SSHJAgentAdaptor.java:146)
>  at 
> org.apache.airavata.helix.impl.task.env.EnvSetupTask.onRun(EnvSetupTask.java:51)
>  ... 10 more Caused by: net.schmizz.sshj.connection.ConnectionException: 
> [CONNECTION_LOST] Did not receive any keep-alive response for 25 seconds at 
> net.schmizz.keepalive.KeepAliveRunner.checkMaxReached(KeepAliveRunner.java:64)
>  at 
> net.schmizz.keepalive.KeepAliveRunner.doKeepAlive(KeepAliveRunner.java:56) at 
> net.schmizz.keepalive.KeepAlive.run(KeepAlive.java:63)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2826) Helix participant server was stopped and started while experiments are launched and job submissions to Jetstream cluster failed

2018-09-21 Thread Dimuthu Upeksha (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRAVATA-2826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623942#comment-16623942
 ] 

Dimuthu Upeksha commented on AIRAVATA-2826:
---

Added job submission retrying logic
 * [|https://issues.apache.org/jira/secure/AddComment!default.jspa?id=13166404]

> Helix participant server was stopped and started while experiments are 
> launched and job submissions to Jetstream cluster failed
> ---
>
> Key: AIRAVATA-2826
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2826
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://staging.seagrid.org/
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> # Experiments started launching while helix participant stopped and started.
>  # When the helix participant was started particularly jobs to Jetstream 
> failed.
>  # Job submission failed due to environment set up failed in jetstream with 
> error [1] 
> [1]
> org.apache.airavata.helix.impl.task.TaskOnFailException: Error Code : 
> 658d46e9-b08b-46c0-9701-4bf5eeb23134, Task 
> TASK_f4e3eccf-3e03-4d34-9cf0-7028efd09a40 failed due to Failed to setup 
> environment of task TASK_f4e3eccf-3e03-4d34-9cf0-7028efd09a40, 
> net.schmizz.sshj.connection.ConnectionException: [CONNECTION_LOST] Did not 
> receive any keep-alive response for 25 seconds at 
> org.apache.airavata.helix.impl.task.AiravataTask.onFail(AiravataTask.java:102)
>  at 
> org.apache.airavata.helix.impl.task.env.EnvSetupTask.onRun(EnvSetupTask.java:55)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:311) 
> at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:90) at 
> org.apache.helix.task.TaskRunner.run(TaskRunner.java:71) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Caused by: 
> org.apache.airavata.agents.api.AgentException: 
> net.schmizz.sshj.connection.ConnectionException: [CONNECTION_LOST] Did not 
> receive any keep-alive response for 25 seconds at 
> org.apache.airavata.helix.adaptor.SSHJAgentAdaptor.createDirectory(SSHJAgentAdaptor.java:146)
>  at 
> org.apache.airavata.helix.impl.task.env.EnvSetupTask.onRun(EnvSetupTask.java:51)
>  ... 10 more Caused by: net.schmizz.sshj.connection.ConnectionException: 
> [CONNECTION_LOST] Did not receive any keep-alive response for 25 seconds at 
> net.schmizz.keepalive.KeepAliveRunner.checkMaxReached(KeepAliveRunner.java:64)
>  at 
> net.schmizz.keepalive.KeepAliveRunner.doKeepAlive(KeepAliveRunner.java:56) at 
> net.schmizz.keepalive.KeepAlive.run(KeepAlive.java:63)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2792) Staging seagrid fails to submit a job

2018-09-21 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2792.
---
Resolution: Fixed

> Staging seagrid fails to submit a job
> -
>
> Key: AIRAVATA-2792
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2792
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Reporter: Sudhakar Pamidighantam
>Assignee: Dimuthu Upeksha
>Priority: Major
>
> org.apache.airavata.helix.impl.task.TaskOnFailException: Error Code : 
> f32162d3-9409-4ba9-92c3-aee14c8e5fb4, Task 
> TASK_5bf0a74e-6d0a-48bf-87d1-1af985bd90fc failed due to Failed to setup 
> environment of task TASK_5bf0a74e-6d0a-48bf-87d1-1af985bd90fc, null at 
> org.apache.airavata.helix.impl.task.AiravataTask.onFail(AiravataTask.java:102)
>  at 
> org.apache.airavata.helix.impl.task.env.EnvSetupTask.onRun(EnvSetupTask.java:53)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:311) 
> at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:82) at 
> org.apache.helix.task.TaskRunner.run(TaskRunner.java:71) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Caused by: 
> java.lang.NullPointerException at 
> org.apache.airavata.helix.impl.task.TaskContext.getComputeResourceCredentialToken(TaskContext.java:422)
>  at 
> org.apache.airavata.helix.impl.task.env.EnvSetupTask.onRun(EnvSetupTask.java:45)
>  ... 10 more



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2790) File uploading error due to session channel opening error occurred!

2018-09-21 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2790.
---
Resolution: Fixed

> File uploading error due to session channel opening error occurred!
> ---
>
> Key: AIRAVATA-2790
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2790
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.17
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.17
>
>
> Intermittent error [1] when launching an experiment at file uploading. exp 
> ID: 
> |Test1-US-LoneStar5-38_d5273cc4-e4e2-447c-8445-474c00b599ba|
>  
> [1]
> org.apache.airavata.helix.impl.task.TaskOnFailException: Error Code : 
> 63e7ddd0-07d2-4eb2-814e-31f3f8a36c6c, Task 
> TASK_b44a6620-f1f8-4bc7-bd15-286daa41bcf1 failed due to Failed uploading the 
> input file to 
> /scratch/01623/us3/airavata-workingdirs/PROCESS_faad2856-46b2-4dcd-8cfd-8b59fa55343e/
>  from local path 
> /tmp/PROCESS_faad2856-46b2-4dcd-8cfd-8b59fa55343e/temp_inputs/hpcinput-localhost-uslims3_cauma3d-00950.tar,
>  Opening `session` channel failed: open failed at 
> org.apache.airavata.helix.impl.task.AiravataTask.onFail(AiravataTask.java:102)
>  at 
> org.apache.airavata.helix.impl.task.staging.InputDataStagingTask.onRun(InputDataStagingTask.java:137)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:311) 
> at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:82) at 
> org.apache.helix.task.TaskRunner.run(TaskRunner.java:71) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Caused by: 
> org.apache.airavata.agents.api.AgentException: Opening `session` channel 
> failed: open failed at 
> org.apache.airavata.helix.adaptor.SSHJAgentAdaptor.copyFileTo(SSHJAgentAdaptor.java:155)
>  at 
> org.apache.airavata.helix.impl.task.staging.InputDataStagingTask.onRun(InputDataStagingTask.java:119)
>  ... 10 more Caused by: Opening `session` channel failed: open failed at 
> net.schmizz.sshj.connection.channel.direct.AbstractDirectChannel.gotOpenFailure(AbstractDirectChannel.java:74)
>  at 
> net.schmizz.sshj.connection.channel.direct.AbstractDirectChannel.gotUnknown(AbstractDirectChannel.java:99)
>  at 
> net.schmizz.sshj.connection.channel.AbstractChannel.handle(AbstractChannel.java:203)
>  at 
> net.schmizz.sshj.connection.ConnectionImpl.handle(ConnectionImpl.java:130) at 
> net.schmizz.sshj.transport.TransportImpl.handle(TransportImpl.java:500) at 
> net.schmizz.sshj.transport.Decoder.decode(Decoder.java:102) at 
> net.schmizz.sshj.transport.Decoder.received(Decoder.java:170) at 
> net.schmizz.sshj.transport.Reader.run(Reader.java:59)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2789) Experiment failed with unexpected error in opening a session channel

2018-09-21 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2789.
---
Resolution: Fixed

> Experiment failed with unexpected error in opening a session channel
> 
>
> Key: AIRAVATA-2789
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2789
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.17
> Environment: https://staging.ultrascan.scigap.org/
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.17
>
>
> Experiment failed with [1]exp ID: 
> |Test1-US-LoneStar5-37_dbcb9fd4-4390-4163-a6c8-2bb92de95ed0|
> org.apache.airavata.helix.impl.task.TaskOnFailException: Error Code : 
> 712f954d-8477-4a91-b71e-ad6bd6df6537, Task 
> TASK_4fff49ba-22e5-4975-b920-3c6756ddb8b8 failed due to Task failed due to 
> unexpected issue, Opening `session` channel failed: open failed at 
> org.apache.airavata.helix.impl.task.AiravataTask.onFail(AiravataTask.java:102)
>  at 
> org.apache.airavata.helix.impl.task.submission.DefaultJobSubmissionTask.onRun(DefaultJobSubmissionTask.java:221)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:311) 
> at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:82) at 
> org.apache.helix.task.TaskRunner.run(TaskRunner.java:71) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Caused by: 
> org.apache.airavata.agents.api.AgentException: Opening `session` channel 
> failed: open failed at 
> org.apache.airavata.helix.adaptor.SSHJAgentAdaptor.copyFileTo(SSHJAgentAdaptor.java:155)
>  at 
> org.apache.airavata.helix.impl.task.submission.JobSubmissionTask.submitBatchJob(JobSubmissionTask.java:80)
>  at 
> org.apache.airavata.helix.impl.task.submission.DefaultJobSubmissionTask.onRun(DefaultJobSubmissionTask.java:81)
>  ... 10 more Caused by: Opening `session` channel failed: open failed at 
> net.schmizz.sshj.connection.channel.direct.AbstractDirectChannel.gotOpenFailure(AbstractDirectChannel.java:74)
>  at 
> net.schmizz.sshj.connection.channel.direct.AbstractDirectChannel.gotUnknown(AbstractDirectChannel.java:99)
>  at 
> net.schmizz.sshj.connection.channel.AbstractChannel.handle(AbstractChannel.java:203)
>  at 
> net.schmizz.sshj.connection.ConnectionImpl.handle(ConnectionImpl.java:130) at 
> net.schmizz.sshj.transport.TransportImpl.handle(TransportImpl.java:500) at 
> net.schmizz.sshj.transport.Decoder.decode(Decoder.java:102) at 
> net.schmizz.sshj.transport.Decoder.received(Decoder.java:170) at 
> net.schmizz.sshj.transport.Reader.run(Reader.java:59)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2784) Airavata unable to connect with the compute resource, comet.sdsc.edu

2018-09-21 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2784.
---
Resolution: Fixed

> Airavata unable to connect with the compute resource, comet.sdsc.edu
> 
>
> Key: AIRAVATA-2784
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2784
> Project: Airavata
>  Issue Type: Bug
>  Components: GFac
>Affects Versions: 0.17
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.17
>
>
> # When the data staging out task was initialted the host was unreachable.
>  # Hence the data staging was not carried out.
>  # Error messages in log [1]
>  
> [1]
> 2018-05-07 13:41:31,486 [pool-11-thread-2311] INFO  
> o.a.a.g.c.context.TaskContext  - expId: 
> AC2_post_25_rtot_TAC_0013_ratio_056_sp_62b8889f-1a9e-4abc-bd71-030839773109, 
> processId: PROCESS_c91901c1-1d91-4249-88da-a5c0a6245965, taskId: 
> TASK_26fcbd23-0a58-43fd-a05d-0462f6f23273, type: DATA_STAGING : Task status 
> changed CREATED -> EXECUTING
> 2018-05-07 13:41:31,500 [pool-11-thread-2311] INFO  
> o.a.airavata.gfac.impl.Factory  - Session validation failed, key 
> :svuser_comet.sdsc.edu_22_f5c9e1fd-acee-43b6-b326-608b18e02aca
> 2018-05-07 13:41:31,500 [pool-11-thread-2311] INFO  
> o.a.airavata.gfac.impl.Factory  - Initialize a new SSH session for 
> :svuser_comet.sdsc.edu_22_f5c9e1fd-acee-43b6-b326-608b18e02aca
> 2018-05-07 13:41:34,549 [pool-11-thread-2311] ERROR 
> o.[a.a.gfac|http://a.a.gfac/].core.GFacException  - JSch initialization error
> com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route to 
> host (Host unreachable)
> at com.jcraft.jsch.Util.createSocket(Util.java:349)
> at com.jcraft.jsch.Session.connect(Session.java:215)
> at com.jcraft.jsch.Session.connect(Session.java:183)
> at 
> org.apache.airavata.gfac.impl.Factory.getSSHSession(Factory.java:537)
> at 
> org.apache.airavata.gfac.impl.task.ArchiveTask.execute(ArchiveTask.java:107)
> at 
> org.apache.airavata.gfac.impl.GFacEngineImpl.executeTask(GFacEngineImpl.java:814)
> at 
> org.apache.airavata.gfac.impl.GFacEngineImpl.outputDataStaging(GFacEngineImpl.java:766)
> at 
> org.apache.airavata.gfac.impl.GFacEngineImpl.executeTaskListFrom(GFacEngineImpl.java:362)
> at 
> org.apache.airavata.gfac.impl.GFacEngineImpl.continueProcess(GFacEngineImpl.java:721)
> at 
> org.apache.airavata.gfac.impl.GFacWorker.continueTaskExecution(GFacWorker.java:196)
> at org.apache.airavata.gfac.impl.GFacWorker.run(GFacWorker.java:96)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.NoRouteToHostException: No route to host (Host 
> unreachable)
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at 
> [java.net|http://java.net/].AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
> /AC2_post_25_rtot_TAC_0013_ratio_056_sp_62b8889f-1a9e-4abc-bd71-030839773109
> at 
> org.apache.airavata.gfac.impl.GFacEngineImpl.executeTask(GFacEngineImpl.java:814)
> at 
> org.apache.airavata.gfac.impl.GFacEngineImpl.outputDataStaging(GFacEngineImpl.java:766)
> at 
> org.apache.airavata.gfac.impl.GFacEngineImpl.executeTaskListFrom(GFacEngineImpl.java:362)
> at 
> org.apache.airavata.gfac.impl.GFacEngineImpl.continueProcess(GFacEngineImpl.java:721)
> at 
> org.apache.airavata.gfac.impl.GFacWorker.continueTaskExecution(GFacWorker.java:196)
> at org.apache.airavata.gfac.impl.GFacWorker.run(GFacWorker.java:96)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No 
> route to host (Host unreachable)
> at com.jcraft.jsch.Util.createSocket(Util.java:349)
> at com.jcraft.jsch.Session.connect(Session.java:215)
> at com.jcraft.jsch.Session.connect(Session.java:183)
> at 
> org.apache.airavata.gfac.impl.Factory.getSSHSession(Factory.java:537)
> ... 10 common frames omitted
> Caused by: java.net.NoRouteToHostException: No route to host (Host 
> unreachable)
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at 
> [java.net|http://java.net/].AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
> at 
> 

[jira] [Resolved] (AIRAVATA-2786) Job COMPLETED but experiment failed with error message "unknown error occurred when initializing ..... "

2018-09-21 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2786.
---
Resolution: Fixed

> Job COMPLETED but experiment failed with error message "unknown error 
> occurred when initializing . "
> 
>
> Key: AIRAVATA-2786
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2786
> Project: Airavata
>  Issue Type: Bug
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
>
> Job completed and stderr is stagged out.
> But experiment failed with error [1]
>  
> [1]
> org.apache.airavata.helix.impl.task.TaskOnFailException: Error Code : 
> 897327b3-d45c-4f9a-a0c8-4f9a8a323ca0, Task 
> TASK_7c95f237-5002-4fac-9c5d-c9f5a8ac2c6e failed due to Unknown error while 
> running task TASK_7c95f237-5002-4fac-9c5d-c9f5a8ac2c6e, Error occurred while 
> initializing the task TASK_7c95f237-5002-4fac-9c5d-c9f5a8ac2c6e of experiment 
> Test1-US-Jetstream-iteration:26_8b7fe60e-cd90-498c-83b5-29776c3f0855 at 
> org.apache.airavata.helix.impl.task.AiravataTask.onFail(AiravataTask.java:102)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:313) 
> at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:82) at 
> org.apache.helix.task.TaskRunner.run(TaskRunner.java:71) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Caused by: 
> org.apache.airavata.helix.impl.task.TaskOnFailException: Error occurred while 
> initializing the task TASK_7c95f237-5002-4fac-9c5d-c9f5a8ac2c6e of experiment 
> Test1-US-Jetstream-iteration:26_8b7fe60e-cd90-498c-83b5-29776c3f0855 at 
> org.apache.airavata.helix.impl.task.AiravataTask.loadContext(AiravataTask.java:379)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:307) 
> ... 9 more Caused by: RegistryServiceException(message:Error while retrieving 
> application interface. More info : 
> org.apache.airavata.registry.cpi.AppCatalogException: 
> org.apache.openjpa.persistence.InvalidStateException: Can only perform 
> operation while a transaction is active.) at 
> org.apache.airavata.registry.api.RegistryService$getApplicationInterface_result$getApplicationInterface_resultStandardScheme.read(RegistryService.java)
>  at 
> org.apache.airavata.registry.api.RegistryService$getApplicationInterface_result$getApplicationInterface_resultStandardScheme.read(RegistryService.java)
>  at 
> org.apache.airavata.registry.api.RegistryService$getApplicationInterface_result.read(RegistryService.java)
>  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:89) at 
> org.apache.airavata.registry.api.RegistryService$Client.recv_getApplicationInterface(RegistryService.java:4686)
>  at 
> org.apache.airavata.registry.api.RegistryService$Client.getApplicationInterface(RegistryService.java:4673)
>  at 
> org.apache.airavata.helix.impl.task.TaskContext$TaskContextBuilder.build(TaskContext.java:763)
>  at 
> org.apache.airavata.helix.impl.task.AiravataTask.loadContext(AiravataTask.java:374)
>  ... 10 more



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2750) Helix Participant is not picking up tasks after a restart

2018-09-21 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2750.
---
Resolution: Fixed

> Helix Participant is not picking up tasks after a restart
> -
>
> Key: AIRAVATA-2750
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2750
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Reporter: Dimuthu Upeksha
>Assignee: Dimuthu Upeksha
>Priority: Major
>
> Helix Participant was restarted due to an OOM issue then it did not pickup 
> any task. By changing the participant name fixed that. Controller log
>  
> 2018-04-11 19:17:41,850 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - All partitions: [0] taskAssignment: 
> \{helixparticipant=[]} excludedInstances: []
> 2018-04-11 19:17:41,850 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Throttle tasks to be assigned to instance 
> helixparticipant using limitation: Job Concurrent Task(1), Participant Max 
> Task(40). Remaining capacity -8.
> 2018-04-11 19:17:41,859 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Job 
> Workflow_of_process_PROCESS_e14813b1-a93b-47c8-9faa-634b3cdf47b7-POST-f9e7f2c1-e3af-4f46-8740-b71289e23270_TASK_70f5baae-6e11-4448-9962-e7a964cdff37
>  new assignment []
> 2018-04-11 19:17:41,859 [GenericHelixController-event_process] DEBUG 
> o.a.h.c.s.BestPossibleStateCalcStage - Processing 
> resource:Workflow_of_process_PROCESS_c3fa99be-557a-4c25-bbb7-d4bada5d0ede-PRE-06933b15-fb89-48b9-8501-3bd4a20a1a5f_TASK_ab90e04f-a4d6-4ead-b81c-f021748f4179
> 2018-04-11 19:17:41,859 [GenericHelixController-event_process] DEBUG 
> o.a.h.c.s.BestPossibleStateCalcStage - resource 
> Workflow_of_process_PROCESS_c3fa99be-557a-4c25-bbb7-d4bada5d0ede-PRE-06933b15-fb89-48b9-8501-3bd4a20a1a5f_TASK_ab90e04f-a4d6-4ead-b81c-f021748f4179
>  use idealStateRebalancer org.apache.helix.task.JobRebalancer
> 2018-04-11 19:17:41,859 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Computer Best Partition for job: 
> Workflow_of_process_PROCESS_c3fa99be-557a-4c25-bbb7-d4bada5d0ede-PRE-06933b15-fb89-48b9-8501-3bd4a20a1a5f_TASK_ab90e04f-a4d6-4ead-b81c-f021748f4179
> 2018-04-11 19:17:41,860 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - All partitions: [0] taskAssignment: 
> \{helixparticipant=[]} excludedInstances: []
> 2018-04-11 19:17:41,861 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Throttle tasks to be assigned to instance 
> helixparticipant using limitation: Job Concurrent Task(1), Participant Max 
> Task(40). Remaining capacity -8.
> 2018-04-11 19:17:41,871 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Job 
> Workflow_of_process_PROCESS_c3fa99be-557a-4c25-bbb7-d4bada5d0ede-PRE-06933b15-fb89-48b9-8501-3bd4a20a1a5f_TASK_ab90e04f-a4d6-4ead-b81c-f021748f4179
>  new assignment []
> 2018-04-11 19:17:41,872 [GenericHelixController-event_process] DEBUG 
> o.a.h.c.s.BestPossibleStateCalcStage - Processing 
> resource:Workflow_of_process_PROCESS_5b71bc64-49f9-4bf5-801d-359dc35f58ef-POST-54334da3-d6b8-4d9f-b956-9fd943290d66_TASK_0f141d85-8633-470e-81bb-5158bf8e2ad9
> 2018-04-11 19:17:41,872 [GenericHelixController-event_process] DEBUG 
> o.a.h.c.s.BestPossibleStateCalcStage - resource 
> Workflow_of_process_PROCESS_5b71bc64-49f9-4bf5-801d-359dc35f58ef-POST-54334da3-d6b8-4d9f-b956-9fd943290d66_TASK_0f141d85-8633-470e-81bb-5158bf8e2ad9
>  use idealStateRebalancer org.apache.helix.task.JobRebalancer
> 2018-04-11 19:17:41,872 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Computer Best Partition for job: 
> Workflow_of_process_PROCESS_5b71bc64-49f9-4bf5-801d-359dc35f58ef-POST-54334da3-d6b8-4d9f-b956-9fd943290d66_TASK_0f141d85-8633-470e-81bb-5158bf8e2ad9
> 2018-04-11 19:17:41,873 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - All partitions: [0] taskAssignment: 
> \{helixparticipant=[]} excludedInstances: []
> 2018-04-11 19:17:41,873 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Throttle tasks to be assigned to instance 
> helixparticipant using limitation: Job Concurrent Task(1), Participant Max 
> Task(40). Remaining capacity -8.
> 2018-04-11 19:17:41,884 [GenericHelixController-event_process] DEBUG 
> o.a.helix.task.JobRebalancer - Job 
> Workflow_of_process_PROCESS_5b71bc64-49f9-4bf5-801d-359dc35f58ef-POST-54334da3-d6b8-4d9f-b956-9fd943290d66_TASK_0f141d85-8633-470e-81bb-5158bf8e2ad9
>  new assignment []
> 2018-04-11 19:17:41,884 [GenericHelixController-event_process] DEBUG 
> o.a.h.c.s.BestPossibleStateCalcStage - Processing 
> 

[jira] [Closed] (AIRAVATA-2783) Gateway output file (.tar.gz) not existing when staging out but in real it exists in the working directory

2018-09-21 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha closed AIRAVATA-2783.
-
Resolution: Fixed

Closed as this is no longer an issue as we are deprecating gfac

> Gateway output file (.tar.gz) not existing when staging out but in real it 
> exists in the working directory
> --
>
> Key: AIRAVATA-2783
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2783
> Project: Airavata
>  Issue Type: Bug
>  Components: GFac
>Affects Versions: 0.17
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.17
>
>
> # Once the job is completed and email was sent, job updated with status.
>  # Output file transfer initiates.
>  # When tried to transfer .tar.gz getting error [1]
>  # In real the tar.gz files exists in the working directory and it is brought 
> back in ARCHIVE.
> [1]
> 2018-05-07 02:50:25,808 [pool-11-thread-2276] INFO  
> o.a.a.g.i.t.SCPDataStageTask  - Fetching output files for wildcard *.tar.gz 
> in path 
> /oasis/scratch/comet/svuser/temp_project/simvascular_workdirs/PROCESS_548730f9-2a6b-4d4a-a6c7-7b1556895c97
> 2018-05-07 02:50:26,196 [pool-11-thread-2276] WARN  
> o.a.[a.g.impl|http://a.g.impl/].HPCRemoteCluster  - No matching file found 
> for extension: *.tar.gz in the 
> /oasis/scratch/comet/svuser/temp_project/simvascular_workdirs/PROCESS_548730f9-2a6b-4d4a-a6c7-7b1556895c97
>  directory
> 2018-05-07 02:50:26,196 [pool-11-thread-2276] INFO  
> o.a.a.g.i.t.SCPDataStageTask  - File names that matched with wildcard 
> *.tar.gz : []



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2689) Distributed email clients to improve email monitoring

2018-09-21 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2689.
---
Resolution: Fixed

Fixed as a part of new helix implementation. Job monitors were taken out from 
the core execution logic. 
https://github.com/apache/airavata/tree/develop/modules/job-monitor/email-monitor

> Distributed email clients to improve email monitoring 
> --
>
> Key: AIRAVATA-2689
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2689
> Project: Airavata
>  Issue Type: New Feature
>Reporter: Dimuthu Upeksha
>Assignee: Dimuthu Upeksha
>Priority: Major
>  Labels: HackIllinois2018
>
> Once Airavata submits a job to a compute resource, scheduler in compute 
> resource sends emails about the status of the job. Content in the email is 
> different to each application type so we have written a set of parsers [2] 
> which can extract correct information form email messages. Airavata has an 
> email monitoring system which reads those emails, parse them and perform 
> necessary actions depending on the content of the emails. However this email 
> monitoring system is tightly coupled into the task execution engine so we 
> can't easily replicate it to have high availability.
> Idea is to come up with a standalone email monitoring client that reads 
> emails from a given email account, parse them and convert it into a standard 
> message format. Once the message is parsed into the known message format, put 
> it in to a queue ( rabbitmq, kafka) in order to be consumed by task execution 
> engine. There are few non functional requirements
>  # To improve the availability, we need to have more than one monitoring 
> client to be running at a given time. However we need to make sure only 
> exactly one client consumes a given email. So we need the coordination among 
> email clients
>  #  In future, this will be deployed as a micro service, so final packaging 
> should be compatible with docker
> Current email monitor implementation is this [1]. Set of parsers available 
> depending on the application [2]
> [1] 
> [https://github.com/apache/airavata/blob/master/modules/gfac/gfac-impl/src/main/java/org/apache/airavata/gfac/monitor/email/EmailBasedMonitor.java]
> [2] 
> https://github.com/apache/airavata/tree/master/modules/gfac/gfac-impl/src/main/java/org/apache/airavata/gfac/monitor/email/parser



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2386) Fix issues with email monitoring

2018-09-21 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2386.
---
Resolution: Fixed

New job monitors are running based on a state model so the ordering of the 
emails are not relevant.

https://github.com/apache/airavata/tree/develop/modules/job-monitor/email-monitor

> Fix issues with email monitoring
> 
>
> Key: AIRAVATA-2386
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2386
> Project: Airavata
>  Issue Type: Task
>  Components: Airavata System, GFac
>Affects Versions: 0.17
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
> Attachments: jobstatus.ps
>
>
> There are few issues with email monitoring and the task is to collect them 
> and fix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2942) Experiment cancelation request was not processed in Helix

2018-11-16 Thread Dimuthu Upeksha (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRAVATA-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689597#comment-16689597
 ] 

Dimuthu Upeksha commented on AIRAVATA-2942:
---

Fixed in

[https://github.com/apache/airavata/commit/3e879e4325fba582186acd8133552ea6f8542cb6]

[https://github.com/apache/airavata/commit/e642e7f1e42072bbe63dbd3e357acc8af2a8fa74]

[https://github.com/apache/airavata/commit/bb7d3ca0770a1a0301d975e30e7eb37f9834b7c4]

[https://github.com/apache/airavata/commit/23631de8140ef0d74c95cfaf5eeb4228e800a037]

https://github.com/apache/airavata/commit/3f68a88febe5c7997d2c2627fd720fbdda278494

> Experiment cancelation request was not processed in Helix 
> --
>
> Key: AIRAVATA-2942
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2942
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://staging.ultrascan.scigap.org - exp ID: 
> US3-ADEV_4e8bdbb7-20f9-4dcf-82be-e2c581b651b6
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> Experiment was launched and the job was submitted to Slurm cluster. After 
> nearly 4 hours the user sends a cancel request as the job was queued for a 
> long time. The cancel request has come on the same day but it didnt process 
> it. Hence the experiment was left as CANCELING without processing and 
> cancelling it.
> This issue was seeing occurring in multiple HPC clusters with the gateway.
> NOTE: the email shows that this was queued for nearly a day but no indication 
> of job completing email in the mailbox.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2942) Experiment cancelation request was not processed in Helix

2018-11-16 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2942.
---
Resolution: Fixed

> Experiment cancelation request was not processed in Helix 
> --
>
> Key: AIRAVATA-2942
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2942
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://staging.ultrascan.scigap.org - exp ID: 
> US3-ADEV_4e8bdbb7-20f9-4dcf-82be-e2c581b651b6
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> Experiment was launched and the job was submitted to Slurm cluster. After 
> nearly 4 hours the user sends a cancel request as the job was queued for a 
> long time. The cancel request has come on the same day but it didnt process 
> it. Hence the experiment was left as CANCELING without processing and 
> cancelling it.
> This issue was seeing occurring in multiple HPC clusters with the gateway.
> NOTE: the email shows that this was queued for nearly a day but no indication 
> of job completing email in the mailbox.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRAVATA-2956) Possible race condition in job monitoring

2018-11-24 Thread Dimuthu Upeksha (JIRA)
Dimuthu Upeksha created AIRAVATA-2956:
-

 Summary: Possible race condition in job monitoring
 Key: AIRAVATA-2956
 URL: https://issues.apache.org/jira/browse/AIRAVATA-2956
 Project: Airavata
  Issue Type: Bug
  Components: helix implementation
Reporter: Dimuthu Upeksha


When Job submission task submits a job to a compute resource, it returns a job 
id and then it is saved in a zookeeper path for post workflow execution. But in 
some cases, job completes before those metadata is saved in zookeeper and then 
post workflow fails. 

018-11-21 18:15:55,783 [main] INFO  o.a.a.h.i.w.PostWorkflowManager  - 
Processing job result of job id 9839 sent by EmailBasedProducer
2018-11-21 18:15:55,785 [main] WARN  o.a.a.h.i.w.PostWorkflowManager  - Could 
not find a monitoring register for job id 9839
2018-11-21 18:15:55,785 [main] INFO  o.a.a.h.i.w.PostWorkflowManager  - Status 
of processing 9839 : false



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2956) Possible race condition in job monitoring

2018-11-25 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2956.
---
Resolution: Fixed

Added validation logic into AbstactParser before putting a job status into the 
job status queue

If the validation fails, Email Monitor keeps the emails unread until a given 
period of time

> Possible race condition in job monitoring
> -
>
> Key: AIRAVATA-2956
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2956
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Reporter: Dimuthu Upeksha
>Assignee: Dimuthu Upeksha
>Priority: Major
>
> When Job submission task submits a job to a compute resource, it returns a 
> job id and then it is saved in a zookeeper path for post workflow execution. 
> But in some cases, job completes before those metadata is saved in zookeeper 
> and then post workflow fails. 
> 018-11-21 18:15:55,783 [main] INFO  o.a.a.h.i.w.PostWorkflowManager  - 
> Processing job result of job id 9839 sent by EmailBasedProducer
> 2018-11-21 18:15:55,785 [main] WARN  o.a.a.h.i.w.PostWorkflowManager  - Could 
> not find a monitoring register for job id 9839
> 2018-11-21 18:15:55,785 [main] INFO  o.a.a.h.i.w.PostWorkflowManager  - 
> Status of processing 9839 : false



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2962) Issue with experiment cancelation request prior to job submission

2018-12-19 Thread Dimuthu Upeksha (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRAVATA-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725171#comment-16725171
 ] 

Dimuthu Upeksha commented on AIRAVATA-2962:
---

Fixed in 

[https://github.com/apache/airavata/commit/a6ef239695f87c8d9ebf97ff6977b335e6e9820a]

https://github.com/apache/airavata/commit/711c2d77ccdac6ad4a5a861bb178ec3c0a274fe0

> Issue with experiment cancelation request prior to job submission
> -
>
> Key: AIRAVATA-2962
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2962
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://staging.ultrascan.scigap.org
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> The cancelation was not moving forward and the experiment is in CANCELING.
> The helix has timed out while waiting to stop the pre workflow for 
> cancelation.
> Error in participant log:
> 2018-12-18 18:52:44,788 [TaskStateModelFactory-task_thread] ERROR 
> o.a.a.h.i.t.c.WorkflowCancellationTask - Failed to stop workflow 
> Workflow_of_process_PROCESS_59bc08eb-23c8-487e-9ea3-c7de8b38fdd1-PRE-21e29feb-4a09-46ac-a504-1b9f5ee3a483
> org.apache.helix.HelixException: Workflow 
> "Workflow_of_process_PROCESS_59bc08eb-23c8-487e-9ea3-c7de8b38fdd1-PRE-21e29feb-4a09-46ac-a504-1b9f5ee3a483"
>  context is empty or not in states: 
> "[Lorg.apache.helix.task.TaskState;@522e1107"
>  at org.apache.helix.task.TaskDriver.pollForWorkflowState(TaskDriver.java:700)
>  at 
> org.apache.airavata.helix.impl.task.cancel.WorkflowCancellationTask.onRun(WorkflowCancellationTask.java:71)
>  at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:92)
>  at org.apache.helix.task.TaskRunner.run(TaskRunner.java:71)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRAVATA-2962) Issue with experiment cancelation request prior to job submission

2018-12-19 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha resolved AIRAVATA-2962.
---
Resolution: Fixed

> Issue with experiment cancelation request prior to job submission
> -
>
> Key: AIRAVATA-2962
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2962
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://staging.ultrascan.scigap.org
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> The cancelation was not moving forward and the experiment is in CANCELING.
> The helix has timed out while waiting to stop the pre workflow for 
> cancelation.
> Error in participant log:
> 2018-12-18 18:52:44,788 [TaskStateModelFactory-task_thread] ERROR 
> o.a.a.h.i.t.c.WorkflowCancellationTask - Failed to stop workflow 
> Workflow_of_process_PROCESS_59bc08eb-23c8-487e-9ea3-c7de8b38fdd1-PRE-21e29feb-4a09-46ac-a504-1b9f5ee3a483
> org.apache.helix.HelixException: Workflow 
> "Workflow_of_process_PROCESS_59bc08eb-23c8-487e-9ea3-c7de8b38fdd1-PRE-21e29feb-4a09-46ac-a504-1b9f5ee3a483"
>  context is empty or not in states: 
> "[Lorg.apache.helix.task.TaskState;@522e1107"
>  at org.apache.helix.task.TaskDriver.pollForWorkflowState(TaskDriver.java:700)
>  at 
> org.apache.airavata.helix.impl.task.cancel.WorkflowCancellationTask.onRun(WorkflowCancellationTask.java:71)
>  at org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:92)
>  at org.apache.helix.task.TaskRunner.run(TaskRunner.java:71)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRAVATA-2943) Re-queueing and node failures in HPC clusters need to be handled in gateway middleware as resubmitting failures

2019-03-01 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha closed AIRAVATA-2943.
-
Resolution: Fixed

> Re-queueing and node failures in HPC clusters need to be handled in gateway 
> middleware as resubmitting failures 
> 
>
> Key: AIRAVATA-2943
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2943
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://staging.ultrascan.scigap.org slurm job ID 8560 
> in Jetstream
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> Currently in clusters (PBS and SLURM) jobs are getting either re-queued due 
> to node failures. In such scenarios the jobs are been executed after 
> re-queueing but on gateway side it is taken as a FAILED job at the initial 
> NODE_FAIL. 
> These types of failures need to be captured as retrying failures instead of 
> taking it as an end result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2943) Re-queueing and node failures in HPC clusters need to be handled in gateway middleware as resubmitting failures

2019-03-01 Thread Dimuthu Upeksha (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRAVATA-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782158#comment-16782158
 ] 

Dimuthu Upeksha commented on AIRAVATA-2943:
---

Fixed in 
https://github.com/apache/airavata/commit/8b10120be4ce1d0720f214dc5e849d1dc862c595

> Re-queueing and node failures in HPC clusters need to be handled in gateway 
> middleware as resubmitting failures 
> 
>
> Key: AIRAVATA-2943
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2943
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://staging.ultrascan.scigap.org slurm job ID 8560 
> in Jetstream
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> Currently in clusters (PBS and SLURM) jobs are getting either re-queued due 
> to node failures. In such scenarios the jobs are been executed after 
> re-queueing but on gateway side it is taken as a FAILED job at the initial 
> NODE_FAIL. 
> These types of failures need to be captured as retrying failures instead of 
> taking it as an end result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2974) Even COMPLETE jobs are tagged as CANCELED when the experiment is CANCELED

2019-03-01 Thread Dimuthu Upeksha (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRAVATA-2974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782149#comment-16782149
 ] 

Dimuthu Upeksha commented on AIRAVATA-2974:
---

Fixed in 
https://github.com/apache/airavata/commit/039f9a2cdb7f4c7bfad0aa846fe160d478e59644

> Even COMPLETE jobs are tagged as CANCELED when the experiment is CANCELED 
> --
>
> Key: AIRAVATA-2974
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2974
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://testing.seagrid.org
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> Cancelled an experiment where the was already executed and COMPLETE. When the 
> exp status changed to CANCELED so did the status of the job.
> Since the job was already COMPLETE and the SUs were used it should not have 
> changed the status to CANCELED. IT should have remained as COMPLETE.
> exp ID: SLM002-AmberSander-Comet23_88570cbf-cdf3-4b73-aba7-0d2bf6a9a2d5



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRAVATA-2974) Even COMPLETE jobs are tagged as CANCELED when the experiment is CANCELED

2019-03-01 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha closed AIRAVATA-2974.
-
Resolution: Fixed

> Even COMPLETE jobs are tagged as CANCELED when the experiment is CANCELED 
> --
>
> Key: AIRAVATA-2974
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2974
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://testing.seagrid.org
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> Cancelled an experiment where the was already executed and COMPLETE. When the 
> exp status changed to CANCELED so did the status of the job.
> Since the job was already COMPLETE and the SUs were used it should not have 
> changed the status to CANCELED. IT should have remained as COMPLETE.
> exp ID: SLM002-AmberSander-Comet23_88570cbf-cdf3-4b73-aba7-0d2bf6a9a2d5



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2973) Helix submitting two jobs; both at the same time for a single experiment

2019-03-01 Thread Dimuthu Upeksha (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRAVATA-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782154#comment-16782154
 ] 

Dimuthu Upeksha commented on AIRAVATA-2973:
---

Fixed in 
https://github.com/apache/airavata/commit/0f0a52afadcb9bc33439cfb6be4ceb062a01ebfa

> Helix submitting two jobs; both at the same time for a single experiment
> 
>
> Key: AIRAVATA-2973
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2973
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://testing.seagrid.org 
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> Launched an experiment and the experiment has two jobs. Both jobs are created 
> at the same time, they both have same CREATION time. When the experiment was 
> cancelled both got tagged as CANCELLED.
> exp ID: SLM002-AmberSander-Comet9_02a8cf12-75ad-4820-991f-d593ce832945
> Double job submission is random.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRAVATA-2963) Cannot login to testing gateway portal and also getting an error in create experiment.

2019-03-01 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha closed AIRAVATA-2963.
-
Resolution: Fixed

> Cannot login to testing gateway portal and also getting an error in create 
> experiment.
> --
>
> Key: AIRAVATA-2963
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2963
> Project: Airavata
>  Issue Type: Bug
>  Components: PGA PHP Web Gateway
>Affects Versions: 0.18
> Environment: https://testing.seagrid.org
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> # When username and password is enterered getting the exception [1]
>  # When the exception page is refreshed user is in the home page and when 
> clicked 'Create' in Experiment getting the second exception [2]
> [1]UserProfileServiceException
> Error while creating user profile. More info : Failed to update user profile 
> in IAM service
>  
> [2]ErrorException
> Invalid argument supplied for foreach() (View: 
> /var/www/portals/seagrid/app/views/experiment/create.blade.php)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRAVATA-2973) Helix submitting two jobs; both at the same time for a single experiment

2019-03-01 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha closed AIRAVATA-2973.
-
Resolution: Fixed

> Helix submitting two jobs; both at the same time for a single experiment
> 
>
> Key: AIRAVATA-2973
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2973
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://testing.seagrid.org 
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> Launched an experiment and the experiment has two jobs. Both jobs are created 
> at the same time, they both have same CREATION time. When the experiment was 
> cancelled both got tagged as CANCELLED.
> exp ID: SLM002-AmberSander-Comet9_02a8cf12-75ad-4820-991f-d593ce832945
> Double job submission is random.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRAVATA-2993) Hold the execution of Helix components when that API Server is not responding

2019-03-07 Thread Dimuthu Upeksha (JIRA)
Dimuthu Upeksha created AIRAVATA-2993:
-

 Summary: Hold the execution of Helix components when that API 
Server is not responding
 Key: AIRAVATA-2993
 URL: https://issues.apache.org/jira/browse/AIRAVATA-2993
 Project: Airavata
  Issue Type: Improvement
  Components: helix implementation
Affects Versions: 0.17
Reporter: Dimuthu Upeksha
Assignee: Dimuthu Upeksha


When the API server is down or unreachable by helix components, there is no use 
of Helix components to be running tasks as they will eventually fail. So we 
need to pause helix controller, helix participant, post workflow manager, pre 
workflow manager and monitors once they detected the API server is not 
responding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRAVATA-2999) [GSoC] Administration ashboard for Airavata Services

2019-03-21 Thread Dimuthu Upeksha (JIRA)
Dimuthu Upeksha created AIRAVATA-2999:
-

 Summary: [GSoC] Administration ashboard for Airavata Services
 Key: AIRAVATA-2999
 URL: https://issues.apache.org/jira/browse/AIRAVATA-2999
 Project: Airavata
  Issue Type: New Feature
Reporter: Dimuthu Upeksha


Typical Apache Airavata deployment consists of multiple microservices (API 
Server, Participant, Controller, Pre Workflow Manager, Post Workflow Manager, 
Job Monitors and etc) and several other services (Database, Kafka, RabbitMQ, 
Keycloak, Zookeeper, Apache Helix). As it is a deployment with multiple 
components, when it comes to an issue,  it is time consuming to find which 
component is having the problem. So we need an Administration Dashboard which 
can visualize the system health and provide some handle to Administrators to 
control those services like stopping or restarting each component through the 
dashboard.

Additionally, this dashboard should be able to Authenticate users through 
Keycloak which is the identity provider for Airavata and  only system 
administrators should be given access to those operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRAVATA-2999) [GSoC] Administration Dashboard for Airavata Services

2019-03-21 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha updated AIRAVATA-2999:
--
Description: 
Typical Apache Airavata deployment consists of multiple microservices (API 
Server, Participant, Controller, Pre Workflow Manager, Post Workflow Manager, 
Job Monitors and etc) and several other services (Database, Kafka, RabbitMQ, 
Keycloak, Zookeeper, Apache Helix). As it is a deployment with multiple 
components, when it comes to an issue,  it is time consuming to find which 
component is having the problem. So we need an Administration Dashboard which 
can visualize the system health and provide some handle to administrators to 
control those services like stopping or restarting each component through the 
dashboard.

Additionally, this dashboard should be able to authenticate users through 
Keycloak which is the identity provider for Airavata and  only system 
administrators should be given access to those operations.

  was:
Typical Apache Airavata deployment consists of multiple microservices (API 
Server, Participant, Controller, Pre Workflow Manager, Post Workflow Manager, 
Job Monitors and etc) and several other services (Database, Kafka, RabbitMQ, 
Keycloak, Zookeeper, Apache Helix). As it is a deployment with multiple 
components, when it comes to an issue,  it is time consuming to find which 
component is having the problem. So we need an Administration Dashboard which 
can visualize the system health and provide some handle to Administrators to 
control those services like stopping or restarting each component through the 
dashboard.

Additionally, this dashboard should be able to Authenticate users through 
Keycloak which is the identity provider for Airavata and  only system 
administrators should be given access to those operations.

Summary: [GSoC] Administration Dashboard for Airavata Services  (was: 
[GSoC] Administration ashboard for Airavata Services)

> [GSoC] Administration Dashboard for Airavata Services
> -
>
> Key: AIRAVATA-2999
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2999
> Project: Airavata
>  Issue Type: New Feature
>Reporter: Dimuthu Upeksha
>Priority: Major
>
> Typical Apache Airavata deployment consists of multiple microservices (API 
> Server, Participant, Controller, Pre Workflow Manager, Post Workflow Manager, 
> Job Monitors and etc) and several other services (Database, Kafka, RabbitMQ, 
> Keycloak, Zookeeper, Apache Helix). As it is a deployment with multiple 
> components, when it comes to an issue,  it is time consuming to find which 
> component is having the problem. So we need an Administration Dashboard which 
> can visualize the system health and provide some handle to administrators to 
> control those services like stopping or restarting each component through the 
> dashboard.
> Additionally, this dashboard should be able to authenticate users through 
> Keycloak which is the identity provider for Airavata and  only system 
> administrators should be given access to those operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRAVATA-3000) [GSoC] Refactor parser framework into a generic workflow framework

2019-03-21 Thread Dimuthu Upeksha (JIRA)
Dimuthu Upeksha created AIRAVATA-3000:
-

 Summary: [GSoC] Refactor parser framework into a generic workflow 
framework
 Key: AIRAVATA-3000
 URL: https://issues.apache.org/jira/browse/AIRAVATA-3000
 Project: Airavata
  Issue Type: New Feature
Reporter: Dimuthu Upeksha


This an extension of the work we have done in GSoC 2018. Yasas has worked on 
developing a workflow framework for Airavata and his mail list discussions and 
medium post can be found from [1] [2]. Based on his research and the another 
GSoC project which was done in integrating a parser framework for Airavata, we 
have developed a new parser framework [3] [4] [5] [6] which utilized Apache 
Helix task framework [7] as the task execution engine. 

Even though this framework was specifically developed for parsers, we feel that 
this implementation shares a lot of common features of the workflow design 
which Yasas has worked on. So in this project we need to generalize parser 
framework into a generic workflow framework where we can create and launch 
workflows with any application registered in Airavata

Student should be good at Java and distributed system concepts

Key Expectations
 # Refactor/ improve parser data models into workflow data models
 # Refactor/ improve parser tables into workflow database tables
 # Implement new helix tasks to support workflow operations mentioned in [2]
 # Refactor/ improve parser API into a workflow level API [5]
 # Demonstrate that we can still preserve parser workflows in newly created 
generic workflow engine (This will be final demo). 

[1] 
[https://medium.com/@yasgun/gsoc-2018-with-apache-airavata-user-defined-airavata-workflows-39f0e79234ee]

[2] [http://mail-archives.apache.org/mod_mbox/airavata-dev/201806.mbox/browser]

[3] 
[https://github.com/apache/airavata/tree/develop/modules/airavata-helix/helix-spectator/src/main/java/org/apache/airavata/helix/impl/task/parsing]

[4] 
[https://github.com/apache/airavata/blob/develop/modules/airavata-helix/helix-spectator/src/main/java/org/apache/airavata/helix/impl/workflow/ParserWorkflowManager.java]

[5] 
[https://github.com/apache/airavata/blob/develop/thrift-interface-descriptions/airavata-apis/airavata_api.thrift#L3505]

[6] 
https://github.com/apache/airavata/blob/develop/thrift-interface-descriptions/data-models/app-catalog-models/parser_model.thrift

[7] [https://helix.apache.org/0.8.4-docs/tutorial_task_framework.html]

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-3051) Output files configured with wildcard is not brought back and displayed in the Django summary page.

2019-06-02 Thread Dimuthu Upeksha (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRAVATA-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854145#comment-16854145
 ] 

Dimuthu Upeksha commented on AIRAVATA-3051:
---

Fixed in 
[https://github.com/apache/airavata/commit/d4a209d042dd36fe9f86a706e783c3c3ee8a6ec8]

> Output files configured with wildcard is not brought back and displayed in 
> the Django summary page.
> ---
>
> Key: AIRAVATA-3051
> URL: https://issues.apache.org/jira/browse/AIRAVATA-3051
> Project: Airavata
>  Issue Type: Sub-task
>  Components: Django Portal
>Affects Versions: 0.18
> Environment: https://beta-sciencegateway.brylinski.org
>Reporter: Eroma
>Assignee: Marcus Christie
>Priority: Major
> Attachments: Screen Shot 2019-05-30 at 8.40.01 AM.png, Screen Shot 
> 2019-05-30 at 8.40.08 AM.png, Screen Shot 2019-05-30 at 8.42.46 AM.png
>
>
> # Submitted a job with an application where there are 5 output files.
>  # The defined outputs are STDERR. STDOUT and two specifically defined (full 
> output file name given) tar files and another tar file defined with wildcard 
> (*.tar)
>  # The job was successfully completed and all the files are available in the 
> working directory.
>  # In the PGA storage only the files which the defined with the complete name 
> was brought back.
>  # STDOUT, STDERR and the wild card output file is not in the storage.
>  # Hence they are not displayed and downloadable in the summary page.
>  # Apart from this in output section of the summary page there are paths 
> given to all the .tar files as a string.
>  # Please see all the images attached to this.
>  # exp ID: 
> eFindSite_on_May_30,_2019_1:38_AM_85ab2ab9-a1c0-427c-9fb3-ccde9facd8f1
> h5. TODO
> * [ ] Don't display an experiment output entry when the type if URI but the 
> value is not a data product URI



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (AIRAVATA-3051) Output files configured with wildcard is not brought back and displayed in the Django summary page.

2019-06-02 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha reassigned AIRAVATA-3051:
-

Assignee: Dimuthu Upeksha  (was: Marcus Christie)

> Output files configured with wildcard is not brought back and displayed in 
> the Django summary page.
> ---
>
> Key: AIRAVATA-3051
> URL: https://issues.apache.org/jira/browse/AIRAVATA-3051
> Project: Airavata
>  Issue Type: Sub-task
>  Components: Django Portal
>Affects Versions: 0.18
> Environment: https://beta-sciencegateway.brylinski.org
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Attachments: Screen Shot 2019-05-30 at 8.40.01 AM.png, Screen Shot 
> 2019-05-30 at 8.40.08 AM.png, Screen Shot 2019-05-30 at 8.42.46 AM.png
>
>
> # Submitted a job with an application where there are 5 output files.
>  # The defined outputs are STDERR. STDOUT and two specifically defined (full 
> output file name given) tar files and another tar file defined with wildcard 
> (*.tar)
>  # The job was successfully completed and all the files are available in the 
> working directory.
>  # In the PGA storage only the files which the defined with the complete name 
> was brought back.
>  # STDOUT, STDERR and the wild card output file is not in the storage.
>  # Hence they are not displayed and downloadable in the summary page.
>  # Apart from this in output section of the summary page there are paths 
> given to all the .tar files as a string.
>  # Please see all the images attached to this.
>  # exp ID: 
> eFindSite_on_May_30,_2019_1:38_AM_85ab2ab9-a1c0-427c-9fb3-ccde9facd8f1
> h5. TODO
> * [ ] Don't display an experiment output entry when the type if URI but the 
> value is not a data product URI



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRAVATA-2205) Conflicting Loggers

2019-05-02 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha closed AIRAVATA-2205.
-
Resolution: Fixed

> Conflicting Loggers 
> 
>
> Key: AIRAVATA-2205
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2205
> Project: Airavata
>  Issue Type: Bug
>Reporter: Ajinkya
>Priority: Critical
>
> slf4j-log4j12-1.7.10.jar and log4j-1.2.17.jar need to be removed from lib 
> directory.
> These jars are conflicting with new logback integration.
> These files were removed during new logging implementation but was included 
> in later commits. 
> Basically, server won't start with these two jars in lib directory.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-1904) ARCHIVE did not happen in recovery

2019-05-02 Thread Dimuthu Upeksha (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRAVATA-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831876#comment-16831876
 ] 

Dimuthu Upeksha commented on AIRAVATA-1904:
---

[~eroma_a] Do we still need this ticket? Can you verify the behavior with new 
stack?

> ARCHIVE did not happen in recovery
> --
>
> Key: AIRAVATA-1904
> URL: https://issues.apache.org/jira/browse/AIRAVATA-1904
> Project: Airavata
>  Issue Type: Bug
>  Components: GFac, PGA PHP Web Gateway
>Affects Versions: 0.16
> Environment: dev.seagrid.org
>Reporter: Eroma
>Assignee: Suresh Marru
>Priority: Critical
> Fix For: 0.18
>
>
> 1. i submitted an amber job to comet and when it was active in comet i 
> stopped GFAC. 
> 2. I started gfac agin while it was in running state. now the experiment is 
> completed but ARCHIVE did not happen. 
> 3. In storage location also ARCHIVE does not exist 
> gateway-user-data/dev-seagrid/Eroma2016/March_14th_2016/SLM3_AmberSander_Comet1457983469/PROCESS_b4b8f7ce-801d-403b-a286-d7755429eb84
> 4. This is amber_sander application and exp ID is 
> SLM3-AmberSander-Comet_59a5c095-73ba-4dd9-8d13-19709f6fa474



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2205) Conflicting Loggers

2019-05-02 Thread Dimuthu Upeksha (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRAVATA-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831879#comment-16831879
 ] 

Dimuthu Upeksha commented on AIRAVATA-2205:
---

Fixed in latest distributions. So closing

> Conflicting Loggers 
> 
>
> Key: AIRAVATA-2205
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2205
> Project: Airavata
>  Issue Type: Bug
>Reporter: Ajinkya
>Priority: Critical
>
> slf4j-log4j12-1.7.10.jar and log4j-1.2.17.jar need to be removed from lib 
> directory.
> These jars are conflicting with new logback integration.
> These files were removed during new logging implementation but was included 
> in later commits. 
> Basically, server won't start with these two jars in lib directory.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRAVATA-2815) First experiment fails after API server restart

2019-05-02 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha closed AIRAVATA-2815.
-
Resolution: Fixed

> First experiment fails after API server restart
> ---
>
> Key: AIRAVATA-2815
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2815
> Project: Airavata
>  Issue Type: Bug
>  Components: Airavata API, helix implementation, Registry API
>Affects Versions: 0.18
> Environment: https://staging.ultrascan.scigap.org/home
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> After API server restart the connections to Helix servers drops. As a result 
> the very first experiment doesn't move beyond LAUNCHED and its failed in the 
> back end.
> Helix should have a way of establishing the link and have the experiment 
> reprocessed rather than failing. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRAVATA-2738) Experiments are not actually LAUNCHED from orchestrator and not in zookeeper queue

2019-05-02 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha closed AIRAVATA-2738.
-
Resolution: Fixed

> Experiments are not actually LAUNCHED from orchestrator and not in zookeeper 
> queue
> --
>
> Key: AIRAVATA-2738
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2738
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> When experiments are launched the experiment status is changed to LAUNCHED. 
> But the PROCESS ID of the experiment is not really added to the zookeeper 
> queue and hence the it is not further processed by the helix. The 
> orchestrator was unable to connect to zookeeper and couldn't add the ID to 
> the queue and no errors in the orchestrator log as well.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRAVATA-2884) Unusual delay in helix job submission

2019-05-02 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha closed AIRAVATA-2884.
-
Resolution: Fixed

> Unusual delay in helix job submission
> -
>
> Key: AIRAVATA-2884
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2884
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://staging.seagrid.org
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
> Attachments: Screen Shot 2018-09-25 at 2.03.59 PM.png
>
>
> # Unusual delay in job submission from Helix.
>  # The job submission happened after 8, 10, etc minutes from the experiment 
> creation.
>  # Some of the IDs to check
>  ## SLM001-Gaussian-Carbonate:9_3d5f55c2-c3bf-47f2-939a-5c35585f12bb
>  ## SLM001-NEK5000-BR2:9_51ba9624-0db9-4e07-8efc-8d224a71081e
>  # There are some which are create long time ago and the job is not submitted
>  ## SLM001-NEK5000-BR2:9_08ccf14f-34d9-468c-ac9f-6ffe434cbef7
>  ## SLM001-NEK5000-BR2:8_cff5e293-4a84-4835-bbce-fce10acaa254



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2884) Unusual delay in helix job submission

2019-05-02 Thread Dimuthu Upeksha (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRAVATA-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831956#comment-16831956
 ] 

Dimuthu Upeksha commented on AIRAVATA-2884:
---

Fixed in new stack. Fixed in 
[https://github.com/apache/airavata/commit/27eb5129e76dd8d0be7992a8c6b099314d1f5b7e]

This occurred due to the bug in task creation logic where different tasks are 
getting same id and they eventually getting stacked up in helix queues

> Unusual delay in helix job submission
> -
>
> Key: AIRAVATA-2884
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2884
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://staging.seagrid.org
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
> Attachments: Screen Shot 2018-09-25 at 2.03.59 PM.png
>
>
> # Unusual delay in job submission from Helix.
>  # The job submission happened after 8, 10, etc minutes from the experiment 
> creation.
>  # Some of the IDs to check
>  ## SLM001-Gaussian-Carbonate:9_3d5f55c2-c3bf-47f2-939a-5c35585f12bb
>  ## SLM001-NEK5000-BR2:9_51ba9624-0db9-4e07-8efc-8d224a71081e
>  # There are some which are create long time ago and the job is not submitted
>  ## SLM001-NEK5000-BR2:9_08ccf14f-34d9-468c-ac9f-6ffe434cbef7
>  ## SLM001-NEK5000-BR2:8_cff5e293-4a84-4835-bbce-fce10acaa254



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2815) First experiment fails after API server restart

2019-05-02 Thread Dimuthu Upeksha (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRAVATA-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831959#comment-16831959
 ] 

Dimuthu Upeksha commented on AIRAVATA-2815:
---

Fixed in

[https://github.com/apache/airavata/commit/26d3f1a668adf9d069a46f434d461cea4eb23490]

[https://github.com/apache/airavata/commit/feea5203dfb4fdd70caa994794b9bbb15b2ccd8d]

[https://github.com/apache/airavata/commit/55b3dd6b9f958871288be7db482a885b41c09503]

[https://github.com/apache/airavata/commit/82c57c7d637be78a16ab4f954a54ce70d56e2f12]

> First experiment fails after API server restart
> ---
>
> Key: AIRAVATA-2815
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2815
> Project: Airavata
>  Issue Type: Bug
>  Components: Airavata API, helix implementation, Registry API
>Affects Versions: 0.18
> Environment: https://staging.ultrascan.scigap.org/home
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> After API server restart the connections to Helix servers drops. As a result 
> the very first experiment doesn't move beyond LAUNCHED and its failed in the 
> back end.
> Helix should have a way of establishing the link and have the experiment 
> reprocessed rather than failing. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2738) Experiments are not actually LAUNCHED from orchestrator and not in zookeeper queue

2019-05-02 Thread Dimuthu Upeksha (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRAVATA-2738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831962#comment-16831962
 ] 

Dimuthu Upeksha commented on AIRAVATA-2738:
---

Fixed by moving zk level metadata storage to database

> Experiments are not actually LAUNCHED from orchestrator and not in zookeeper 
> queue
> --
>
> Key: AIRAVATA-2738
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2738
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> When experiments are launched the experiment status is changed to LAUNCHED. 
> But the PROCESS ID of the experiment is not really added to the zookeeper 
> queue and hence the it is not further processed by the helix. The 
> orchestrator was unable to connect to zookeeper and couldn't add the ID to 
> the queue and no errors in the orchestrator log as well.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRAVATA-2807) Helix: use groupResourceProfileId on ProcessModel

2019-05-03 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha closed AIRAVATA-2807.
-
Resolution: Fixed

> Helix: use groupResourceProfileId on ProcessModel
> -
>
> Key: AIRAVATA-2807
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2807
> Project: Airavata
>  Issue Type: Story
>Reporter: Marcus Christie
>Assignee: Dimuthu Upeksha
>Priority: Major
>
> See AIRAVATA-2696 for details and pull request that added support to GFac.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2829) Job and experiment both completed as expected but STDOUT is not available as an output in the gateway

2019-05-03 Thread Dimuthu Upeksha (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRAVATA-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832664#comment-16832664
 ] 

Dimuthu Upeksha commented on AIRAVATA-2829:
---

[~eroma_a] can you verify this with latest changes?

> Job and experiment both completed as expected but STDOUT is not available as 
> an output in the gateway
> -
>
> Key: AIRAVATA-2829
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2829
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://staging.seagrid.org/
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> # Stopped pre WM (I dont think this has any link to the issue) and launched 
> experiments.
>  # After experiments are launched started pre WM.
>  # all the jobs got launched in the remote clusters and jobs got completed 
> successfully as well as the experiments.
>  # In slurm machine comet, stdout is not available as an output.
>  # But its available and its in the ARCHIVE directory.
>  # This is a configured output and it should be available as an output in 
> experiment summary
>  ## exp ID: SLM001-Gaussian-Comet0_3d801bf7-02d4-4205-a2f8-65b3dda9d6fc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRAVATA-2955) Helix controller does not get stopped when server is stopped. Had to kill the process to stop the server

2019-05-03 Thread Dimuthu Upeksha (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRAVATA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dimuthu Upeksha closed AIRAVATA-2955.
-
Resolution: Fixed

> Helix controller does not get stopped when server is stopped. Had to kill the 
> process to stop the server
> 
>
> Key: AIRAVATA-2955
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2955
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://staging.ultrascan.scigap.org
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> The experiments were not moving forward from EXECUTING, no tasks were 
> executed and hence  no job was submitted. Then the controller was stopped but 
> when checked the process was running, did not get stopped correctly. 
>  
> Then had to do a kill -9 with process ID to stop and started the server.
> Why the server needed a restart was not very clear.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2955) Helix controller does not get stopped when server is stopped. Had to kill the process to stop the server

2019-05-03 Thread Dimuthu Upeksha (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRAVATA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832641#comment-16832641
 ] 

Dimuthu Upeksha commented on AIRAVATA-2955:
---

Fixed in 
[https://github.com/apache/airavata/commit/8183162556fcfb1ce81257d60f989d4cbbadd911]

> Helix controller does not get stopped when server is stopped. Had to kill the 
> process to stop the server
> 
>
> Key: AIRAVATA-2955
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2955
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
> Environment: https://staging.ultrascan.scigap.org
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> The experiments were not moving forward from EXECUTING, no tasks were 
> executed and hence  no job was submitted. Then the controller was stopped but 
> when checked the process was running, did not get stopped correctly. 
>  
> Then had to do a kill -9 with process ID to stop and started the server.
> Why the server needed a restart was not very clear.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRAVATA-2749) Experiment status not updated, but job is COMPLETED and outputs are staged.

2019-05-03 Thread Dimuthu Upeksha (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRAVATA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832643#comment-16832643
 ] 

Dimuthu Upeksha commented on AIRAVATA-2749:
---

Fixed in latest improvements. Please reopen if necessary.

> Experiment status not updated, but job is COMPLETED and outputs are staged.
> ---
>
> Key: AIRAVATA-2749
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2749
> Project: Airavata
>  Issue Type: Bug
>  Components: helix implementation
>Affects Versions: 0.18
>Reporter: Eroma
>Assignee: Dimuthu Upeksha
>Priority: Major
> Fix For: 0.18
>
>
> # Experiment launched and job submitted and completed sucesfully.
>  # Output files are also staged and available in the gateway portal as well.
>  # Experiment status not changed to COMPLETED, in EXECUTING
>  # No errors in the participant or controller logs.
> |SLM001-Gaussian-Carbonate_8473f6fc-5d24-4101-84db-1b05c46ba882
>  
> |



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >