Hi Junkai, Thanks for the clarification. There are few special properties in our workflows. All the workflows are single branch DAGs so there will be only one job running at a time. By looking at the log, I could see that only the task with this error has been failed. Cleanup agent deleted this workflow after this task is failed so it is clear that no other task is triggering this issue (I checked the timestamp).
However for the instance, I disabled the cleanup agent for a while. Reason for adding this agent is because Helix became slow to schedule pending jobs when the load is high and participant was waiting without running anything for few minutes. We discussed this on thread "Sporadic delays in task execution". Before implementing this agent, I noticed that, there were lots of uncleared znodes related to Completed and Failed workflows and I though that was the reason to slow down controller / participant. After implementing this agent, things went smoothly until this point. Now I understand that you have your own workflow cleanup logic implemented in Helix but we might need to tune it to our case. Can you point me into code / documentation where I can have an idea about that? And this for my understanding, let's say that for some reason Helix failed to clean up completed workflows and related resources in zk. Will that affect to the performance of controller / participant? My understanding was that Helix was registering zk watchers for all the paths irrespective of the status of the workflow/ job/ task. Please correct me if I'm wrong. Thanks Dimuthu On Sat, Nov 10, 2018 at 1:49 AM Xue Junkai <junkai....@gmail.com> wrote: > It is possible. For example, if other jobs caused the workflow failed, it > will trigger the monitoring to clean up the workflow. Then if this job is > still running, you may see the problem. That's what I am trying to ask for, > extra thread deleting/cleaning workflows. > > I can understand it clean up the failed workflow. But I am wondering why > not just set expiry and let Helix controller does the clean up for > completed workflows. > > On Sat, Nov 10, 2018 at 1:30 PM DImuthu Upeksha < > dimuthu.upeks...@gmail.com> wrote: > >> Hi Junkai, >> >> There is a cleanup agent [1] who is monitoring currently available >> workflows and deleting completed and failed workflows to clear up >> zookeeper >> storage. Do you think that this will be causing this issue? >> >> [1] >> >> https://github.com/apache/airavata/blob/staging/modules/airavata-helix/helix-spectator/src/main/java/org/apache/airavata/helix/impl/controller/WorkflowCleanupAgent.java >> >> Thanks >> Dimuthu >> >> On Fri, Nov 9, 2018 at 11:14 PM DImuthu Upeksha < >> dimuthu.upeks...@gmail.com> >> wrote: >> >> > Hi Junkai, >> > >> > There is no manual workflow killing logic implemented but as you have >> > suggested, I need to verify that. Unfortunately all the helix log >> levels in >> > our servers were set to WARN as helix is printing a whole lot of logs in >> > INFO level so there is no much valuable information in logs. Can you >> > specify which class is printing logs associated for workflow termination >> > and I'll enable DEBUG level for that class and observe further. >> > >> > Thanks >> > Dimuthu >> > >> > On Fri, Nov 9, 2018 at 9:18 PM Xue Junkai <junkai....@gmail.com> wrote: >> > >> >> Hmm, that's very strange. The user content store znode only has been >> >> deleted when the workflow is gone. From the log, it shows the znode is >> >> gone. Could you please try to dig the log to find whether the workflow >> has >> >> been manually killed? If that's the case, then it is possible you have >> the >> >> problem. >> >> >> >> On Fri, Nov 9, 2018 at 12:13 PM DImuthu Upeksha < >> >> dimuthu.upeks...@gmail.com> >> >> wrote: >> >> >> >> > Hi Junkai, >> >> > >> >> > Thanks for your suggestion. You have captured most of the parts >> >> correctly. >> >> > There are two jobs as job1 and job2. And there is a dependency that >> job2 >> >> > depends on job1. Until job1 is completed job2 should not be >> scheduled. >> >> And >> >> > task 1 in job 1 is calling that method and it is not updating >> anyone's >> >> > content. It's just putting and value in workflow level. What do you >> >> mean my >> >> > keeping a key-value store in workflow level? I already use that key >> >> value >> >> > store given by helix by calling putUserContent method. >> >> > >> >> > public void sendNextJob(String jobId) { >> >> > putUserContent(WORKFLOW_STARTED, "TRUE", Scope.WORKFLOW); >> >> > if (jobId != null) { >> >> > putUserContent(NEXT_JOB, jobId, Scope.WORKFLOW); >> >> > } >> >> > } >> >> > >> >> > Dimuthu >> >> > >> >> > >> >> > On Fri, Nov 9, 2018 at 2:48 PM Xue Junkai <junkai....@gmail.com> >> wrote: >> >> > >> >> > > In my understanding, it could be you have job1 and job2. The task >> >> running >> >> > > in job1 tries to update content for job2. Then, there could be a >> race >> >> > > condition happening here that job2 is not scheduled. >> >> > > >> >> > > If that's the case, I suggest you can put key-value store at >> workflow >> >> > level >> >> > > since this is cross-job operation. >> >> > > >> >> > > Best, >> >> > > >> >> > > Junkai >> >> > > >> >> > > On Fri, Nov 9, 2018 at 11:45 AM DImuthu Upeksha < >> >> > > dimuthu.upeks...@gmail.com> >> >> > > wrote: >> >> > > >> >> > > > Hi Junkai, >> >> > > > >> >> > > > This method is being called inside a running task. And it is >> working >> >> > for >> >> > > > most of the time. I only saw this in 2 occasions for last few >> months >> >> > and >> >> > > > both of them happened today and yesterday. >> >> > > > >> >> > > > Thanks >> >> > > > Dimuthu >> >> > > > >> >> > > > On Fri, Nov 9, 2018 at 2:40 PM Xue Junkai <junkai....@gmail.com> >> >> > wrote: >> >> > > > >> >> > > > > User content store node will be created one the job has been >> >> > scheduled. >> >> > > > In >> >> > > > > your case, I think the job is not scheduled. This method >> usually >> >> has >> >> > > been >> >> > > > > utilized in running task. >> >> > > > > >> >> > > > > Best, >> >> > > > > >> >> > > > > Junkai >> >> > > > > >> >> > > > > On Fri, Nov 9, 2018 at 8:19 AM DImuthu Upeksha < >> >> > > > dimuthu.upeks...@gmail.com >> >> > > > > > >> >> > > > > wrote: >> >> > > > > >> >> > > > > > Hi Helix Folks, >> >> > > > > > >> >> > > > > > I'm having this sporadic issue in some tasks of our workflows >> >> when >> >> > we >> >> > > > try >> >> > > > > > to store a value in the workflow context and I have added >> both >> >> code >> >> > > > > section >> >> > > > > > and error message below. Do you have an idea what's causing >> >> this? >> >> > > > Please >> >> > > > > > let me know if you need further information. We are using >> Helix >> >> > 0.8.2 >> >> > > > > > >> >> > > > > > public void sendNextJob(String jobId) { >> >> > > > > > putUserContent(WORKFLOW_STARTED, "TRUE", Scope.WORKFLOW); >> >> > > > > > if (jobId != null) { >> >> > > > > > putUserContent(NEXT_JOB, jobId, Scope.WORKFLOW); >> >> > > > > > } >> >> > > > > > } >> >> > > > > > >> >> > > > > > Failed to setup environment of task >> >> > > > > > TASK_55096de4-2cb6-4b09-84fd-7fdddba93435 >> >> > > > > > java.lang.NullPointerException: null >> >> > > > > > at >> >> > org.apache.helix.task.TaskUtil$1.update(TaskUtil.java:358) >> >> > > > > > at >> >> > org.apache.helix.task.TaskUtil$1.update(TaskUtil.java:356) >> >> > > > > > at >> >> > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> org.apache.helix.manager.zk.HelixGroupCommit.commit(HelixGroupCommit.java:126) >> >> > > > > > at >> >> > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> org.apache.helix.manager.zk.ZkCacheBaseDataAccessor.update(ZkCacheBaseDataAccessor.java:306) >> >> > > > > > at >> >> > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> org.apache.helix.store.zk.AutoFallbackPropertyStore.update(AutoFallbackPropertyStore.java:61) >> >> > > > > > at >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> org.apache.helix.task.TaskUtil.addWorkflowJobUserContent(TaskUtil.java:356) >> >> > > > > > at >> >> > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> org.apache.helix.task.UserContentStore.putUserContent(UserContentStore.java:78) >> >> > > > > > at >> >> > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> org.apache.airavata.helix.core.AbstractTask.sendNextJob(AbstractTask.java:136) >> >> > > > > > at >> >> > > > org.apache.airavata.helix.core.OutPort.invoke(OutPort.java:42) >> >> > > > > > at >> >> > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> org.apache.airavata.helix.core.AbstractTask.onSuccess(AbstractTask.java:123) >> >> > > > > > at >> >> > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> org.apache.airavata.helix.impl.task.AiravataTask.onSuccess(AiravataTask.java:97) >> >> > > > > > at >> >> > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> org.apache.airavata.helix.impl.task.env.EnvSetupTask.onRun(EnvSetupTask.java:52) >> >> > > > > > at >> >> > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:349) >> >> > > > > > at >> >> > > > > > >> >> > org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:92) >> >> > > > > > at >> >> org.apache.helix.task.TaskRunner.run(TaskRunner.java:71) >> >> > > > > > at >> >> > > > > > >> >> > > >> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >> >> > > > > > at >> >> java.util.concurrent.FutureTask.run(FutureTask.java:266) >> >> > > > > > at >> >> > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) >> >> > > > > > at >> >> > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) >> >> > > > > > at >> >> > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >> >> > > > > > at >> >> > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >> >> > > > > > at java.lang.Thread.run(Thread.java:748) >> >> > > > > > >> >> > > > > > Thanks >> >> > > > > > Dimuthu >> >> > > > > > >> >> > > > > >> >> > > > > >> >> > > > > -- >> >> > > > > Junkai Xue >> >> > > > > >> >> > > > >> >> > > >> >> > > >> >> > > -- >> >> > > Junkai Xue >> >> > > >> >> > >> >> >> >> >> >> -- >> >> Junkai Xue >> >> >> > >> > > > -- > Junkai Xue >