Yes. You are right. On Tue, Nov 13, 2018 at 7:32 AM DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote:
> Hi Junkai, > > Thanks a lot. I'll try with expiry time then. Is this[1] the place where > Helix has implemented this logic? If that so, default expiry time should be > 24 hours. Am I right? > > [1] > > https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/task/TaskUtil.java#L711 > > Thanks > Dimuthu > > On Mon, Nov 12, 2018 at 10:17 PM Xue Junkai <junkai....@gmail.com> wrote: > > > 1 and 2 are correct. 3 is wrong. The expiry time start counting only when > > the workflow is completed. If it is not scheduled ( dont have enought > > resource) or still running, Helix never deletes it. > > > > > > > > On Sun, Nov 11, 2018 at 8:01 PM DImuthu Upeksha < > > dimuthu.upeks...@gmail.com> wrote: > > > >> Hi Junkai, > >> > >> Thanks for the clarification. That helped a lot. > >> > >> I our case, each of the task of the workflow are depending on the > previous > >> task. So there is no parallel execution. And we are not using Job > Queues. > >> > >> Regarding the expiry time, what are the rules that you are imposing on > >> that? For example let's say I setup an expiry time to 2 hours, I assume > >> following situations are covered in Helix, > >> > >> 1. Even though the workflow is completed before 2 hours, resources > related > >> to that workflow will not be cleared until 2 hours are elapsed and > exactly > >> after 2 hours, all the resources will be cleared by the framework. > >> 2. If the workflow failed, resources will not be cleared even after 2 > >> hours > >> 3. If the workflow wasn't scheduled within 2 hours in a participant, it > >> will be deleted > >> > >> Is my understanding correct? > >> > >> Thanks > >> Dimuthu > >> > >> > >> On Sat, Nov 10, 2018 at 4:26 PM Xue Junkai <junkai....@gmail.com> > wrote: > >> > >> > Hi Dimuthu, > >> > > >> > Couple things here: > >> > 1. Only JobQueue in Helix is single branch DAG and 1 job running at a > >> time > >> > with defining parallel job number to be 1. Otherwise, you may see many > >> jobs > >> > running at same time as you set parallel job number to be a different > >> > number. For generic workflow, all jobs without dependencies could be > >> > dispatched together. > >> > 2. Helix only cleans up the completed generic workflows by deleting > all > >> > the related znode, not for JobQueue. For JobQueue you have to set up > >> > periodical purge time. As Helix defined, JobQueue never finishes and > >> only > >> > can be terminated by manual kill and it can keep accepting dynamic > jobs. > >> > Thus you have to understand your workflow is generic workflow or > >> JobQueue. > >> > For failed generic workflow, even if you setup the expiry time, Helix > >> will > >> > not clean it up as Helix would like to keep it for user further > >> > investigation. > >> > 3. For Helix controller, if Helix failed to clean up workflows, the > only > >> > thing you can see is the having workflows with context but no resource > >> > config and idealstate there. This is because of ZK write fail to clean > >> last > >> > piece, context node. And there is no ideal state can trigger clean up > >> again > >> > for this workflow. > >> > > >> > Please take a look for this task framework tutorial for detailed > >> > configurations: > >> > https://helix.apache.org/0.8.2-docs/tutorial_task_framework.html > >> > > >> > Best, > >> > > >> > Junkai > >> > > >> > On Sat, Nov 10, 2018 at 8:29 AM DImuthu Upeksha < > >> > dimuthu.upeks...@gmail.com> wrote: > >> > > >> >> Hi Junkai, > >> >> > >> >> Thanks for the clarification. There are few special properties in our > >> >> workflows. All the workflows are single branch DAGs so there will be > >> only > >> >> one job running at a time. By looking at the log, I could see that > only > >> >> the > >> >> task with this error has been failed. Cleanup agent deleted this > >> workflow > >> >> after this task is failed so it is clear that no other task is > >> triggering > >> >> this issue (I checked the timestamp). > >> >> > >> >> However for the instance, I disabled the cleanup agent for a while. > >> Reason > >> >> for adding this agent is because Helix became slow to schedule > pending > >> >> jobs > >> >> when the load is high and participant was waiting without running > >> anything > >> >> for few minutes. We discussed this on thread "Sporadic delays in task > >> >> execution". Before implementing this agent, I noticed that, there > were > >> >> lots > >> >> of uncleared znodes related to Completed and Failed workflows and I > >> though > >> >> that was the reason to slow down controller / participant. After > >> >> implementing this agent, things went smoothly until this point. > >> >> > >> >> Now I understand that you have your own workflow cleanup logic > >> implemented > >> >> in Helix but we might need to tune it to our case. Can you point me > >> into > >> >> code / documentation where I can have an idea about that? > >> >> > >> >> And this for my understanding, let's say that for some reason Helix > >> failed > >> >> to clean up completed workflows and related resources in zk. Will > that > >> >> affect to the performance of controller / participant? My > understanding > >> >> was > >> >> that Helix was registering zk watchers for all the paths irrespective > >> of > >> >> the status of the workflow/ job/ task. Please correct me if I'm > wrong. > >> >> > >> >> Thanks > >> >> Dimuthu > >> >> > >> >> On Sat, Nov 10, 2018 at 1:49 AM Xue Junkai <junkai....@gmail.com> > >> wrote: > >> >> > >> >> > It is possible. For example, if other jobs caused the workflow > >> failed, > >> >> it > >> >> > will trigger the monitoring to clean up the workflow. Then if this > >> job > >> >> is > >> >> > still running, you may see the problem. That's what I am trying to > >> ask > >> >> for, > >> >> > extra thread deleting/cleaning workflows. > >> >> > > >> >> > I can understand it clean up the failed workflow. But I am > wondering > >> why > >> >> > not just set expiry and let Helix controller does the clean up for > >> >> > completed workflows. > >> >> > > >> >> > On Sat, Nov 10, 2018 at 1:30 PM DImuthu Upeksha < > >> >> > dimuthu.upeks...@gmail.com> wrote: > >> >> > > >> >> >> Hi Junkai, > >> >> >> > >> >> >> There is a cleanup agent [1] who is monitoring currently available > >> >> >> workflows and deleting completed and failed workflows to clear up > >> >> >> zookeeper > >> >> >> storage. Do you think that this will be causing this issue? > >> >> >> > >> >> >> [1] > >> >> >> > >> >> >> > >> >> > >> > https://github.com/apache/airavata/blob/staging/modules/airavata-helix/helix-spectator/src/main/java/org/apache/airavata/helix/impl/controller/WorkflowCleanupAgent.java > >> >> >> > >> >> >> Thanks > >> >> >> Dimuthu > >> >> >> > >> >> >> On Fri, Nov 9, 2018 at 11:14 PM DImuthu Upeksha < > >> >> >> dimuthu.upeks...@gmail.com> > >> >> >> wrote: > >> >> >> > >> >> >> > Hi Junkai, > >> >> >> > > >> >> >> > There is no manual workflow killing logic implemented but as you > >> have > >> >> >> > suggested, I need to verify that. Unfortunately all the helix > log > >> >> >> levels in > >> >> >> > our servers were set to WARN as helix is printing a whole lot of > >> >> logs in > >> >> >> > INFO level so there is no much valuable information in logs. Can > >> you > >> >> >> > specify which class is printing logs associated for workflow > >> >> termination > >> >> >> > and I'll enable DEBUG level for that class and observe further. > >> >> >> > > >> >> >> > Thanks > >> >> >> > Dimuthu > >> >> >> > > >> >> >> > On Fri, Nov 9, 2018 at 9:18 PM Xue Junkai <junkai....@gmail.com > > > >> >> wrote: > >> >> >> > > >> >> >> >> Hmm, that's very strange. The user content store znode only has > >> been > >> >> >> >> deleted when the workflow is gone. From the log, it shows the > >> znode > >> >> is > >> >> >> >> gone. Could you please try to dig the log to find whether the > >> >> workflow > >> >> >> has > >> >> >> >> been manually killed? If that's the case, then it is possible > you > >> >> have > >> >> >> the > >> >> >> >> problem. > >> >> >> >> > >> >> >> >> On Fri, Nov 9, 2018 at 12:13 PM DImuthu Upeksha < > >> >> >> >> dimuthu.upeks...@gmail.com> > >> >> >> >> wrote: > >> >> >> >> > >> >> >> >> > Hi Junkai, > >> >> >> >> > > >> >> >> >> > Thanks for your suggestion. You have captured most of the > parts > >> >> >> >> correctly. > >> >> >> >> > There are two jobs as job1 and job2. And there is a > dependency > >> >> that > >> >> >> job2 > >> >> >> >> > depends on job1. Until job1 is completed job2 should not be > >> >> >> scheduled. > >> >> >> >> And > >> >> >> >> > task 1 in job 1 is calling that method and it is not updating > >> >> >> anyone's > >> >> >> >> > content. It's just putting and value in workflow level. What > do > >> >> you > >> >> >> >> mean my > >> >> >> >> > keeping a key-value store in workflow level? I already use > that > >> >> key > >> >> >> >> value > >> >> >> >> > store given by helix by calling putUserContent method. > >> >> >> >> > > >> >> >> >> > public void sendNextJob(String jobId) { > >> >> >> >> > putUserContent(WORKFLOW_STARTED, "TRUE", Scope.WORKFLOW); > >> >> >> >> > if (jobId != null) { > >> >> >> >> > putUserContent(NEXT_JOB, jobId, Scope.WORKFLOW); > >> >> >> >> > } > >> >> >> >> > } > >> >> >> >> > > >> >> >> >> > Dimuthu > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > On Fri, Nov 9, 2018 at 2:48 PM Xue Junkai < > >> junkai....@gmail.com> > >> >> >> wrote: > >> >> >> >> > > >> >> >> >> > > In my understanding, it could be you have job1 and job2. > The > >> >> task > >> >> >> >> running > >> >> >> >> > > in job1 tries to update content for job2. Then, there could > >> be a > >> >> >> race > >> >> >> >> > > condition happening here that job2 is not scheduled. > >> >> >> >> > > > >> >> >> >> > > If that's the case, I suggest you can put key-value store > at > >> >> >> workflow > >> >> >> >> > level > >> >> >> >> > > since this is cross-job operation. > >> >> >> >> > > > >> >> >> >> > > Best, > >> >> >> >> > > > >> >> >> >> > > Junkai > >> >> >> >> > > > >> >> >> >> > > On Fri, Nov 9, 2018 at 11:45 AM DImuthu Upeksha < > >> >> >> >> > > dimuthu.upeks...@gmail.com> > >> >> >> >> > > wrote: > >> >> >> >> > > > >> >> >> >> > > > Hi Junkai, > >> >> >> >> > > > > >> >> >> >> > > > This method is being called inside a running task. And it > >> is > >> >> >> working > >> >> >> >> > for > >> >> >> >> > > > most of the time. I only saw this in 2 occasions for last > >> few > >> >> >> months > >> >> >> >> > and > >> >> >> >> > > > both of them happened today and yesterday. > >> >> >> >> > > > > >> >> >> >> > > > Thanks > >> >> >> >> > > > Dimuthu > >> >> >> >> > > > > >> >> >> >> > > > On Fri, Nov 9, 2018 at 2:40 PM Xue Junkai < > >> >> junkai....@gmail.com> > >> >> >> >> > wrote: > >> >> >> >> > > > > >> >> >> >> > > > > User content store node will be created one the job has > >> been > >> >> >> >> > scheduled. > >> >> >> >> > > > In > >> >> >> >> > > > > your case, I think the job is not scheduled. This > method > >> >> >> usually > >> >> >> >> has > >> >> >> >> > > been > >> >> >> >> > > > > utilized in running task. > >> >> >> >> > > > > > >> >> >> >> > > > > Best, > >> >> >> >> > > > > > >> >> >> >> > > > > Junkai > >> >> >> >> > > > > > >> >> >> >> > > > > On Fri, Nov 9, 2018 at 8:19 AM DImuthu Upeksha < > >> >> >> >> > > > dimuthu.upeks...@gmail.com > >> >> >> >> > > > > > > >> >> >> >> > > > > wrote: > >> >> >> >> > > > > > >> >> >> >> > > > > > Hi Helix Folks, > >> >> >> >> > > > > > > >> >> >> >> > > > > > I'm having this sporadic issue in some tasks of our > >> >> workflows > >> >> >> >> when > >> >> >> >> > we > >> >> >> >> > > > try > >> >> >> >> > > > > > to store a value in the workflow context and I have > >> added > >> >> >> both > >> >> >> >> code > >> >> >> >> > > > > section > >> >> >> >> > > > > > and error message below. Do you have an idea what's > >> >> causing > >> >> >> >> this? > >> >> >> >> > > > Please > >> >> >> >> > > > > > let me know if you need further information. We are > >> using > >> >> >> Helix > >> >> >> >> > 0.8.2 > >> >> >> >> > > > > > > >> >> >> >> > > > > > public void sendNextJob(String jobId) { > >> >> >> >> > > > > > putUserContent(WORKFLOW_STARTED, "TRUE", > >> >> Scope.WORKFLOW); > >> >> >> >> > > > > > if (jobId != null) { > >> >> >> >> > > > > > putUserContent(NEXT_JOB, jobId, > >> Scope.WORKFLOW); > >> >> >> >> > > > > > } > >> >> >> >> > > > > > } > >> >> >> >> > > > > > > >> >> >> >> > > > > > Failed to setup environment of task > >> >> >> >> > > > > > TASK_55096de4-2cb6-4b09-84fd-7fdddba93435 > >> >> >> >> > > > > > java.lang.NullPointerException: null > >> >> >> >> > > > > > at > >> >> >> >> > org.apache.helix.task.TaskUtil$1.update(TaskUtil.java:358) > >> >> >> >> > > > > > at > >> >> >> >> > org.apache.helix.task.TaskUtil$1.update(TaskUtil.java:356) > >> >> >> >> > > > > > at > >> >> >> >> > > > > > > >> >> >> >> > > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > >> >> >> >> > > > >> >> >> >> > > >> >> >> >> > >> >> >> > >> >> > >> > org.apache.helix.manager.zk.HelixGroupCommit.commit(HelixGroupCommit.java:126) > >> >> >> >> > > > > > at > >> >> >> >> > > > > > > >> >> >> >> > > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > >> >> >> >> > > > >> >> >> >> > > >> >> >> >> > >> >> >> > >> >> > >> > org.apache.helix.manager.zk.ZkCacheBaseDataAccessor.update(ZkCacheBaseDataAccessor.java:306) > >> >> >> >> > > > > > at > >> >> >> >> > > > > > > >> >> >> >> > > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > >> >> >> >> > > > >> >> >> >> > > >> >> >> >> > >> >> >> > >> >> > >> > org.apache.helix.store.zk.AutoFallbackPropertyStore.update(AutoFallbackPropertyStore.java:61) > >> >> >> >> > > > > > at > >> >> >> >> > > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > >> >> >> >> > > > >> >> >> >> > > >> >> >> >> > >> >> >> > >> >> > >> > org.apache.helix.task.TaskUtil.addWorkflowJobUserContent(TaskUtil.java:356) > >> >> >> >> > > > > > at > >> >> >> >> > > > > > > >> >> >> >> > > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > >> >> >> >> > > > >> >> >> >> > > >> >> >> >> > >> >> >> > >> >> > >> > org.apache.helix.task.UserContentStore.putUserContent(UserContentStore.java:78) > >> >> >> >> > > > > > at > >> >> >> >> > > > > > > >> >> >> >> > > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > >> >> >> >> > > > >> >> >> >> > > >> >> >> >> > >> >> >> > >> >> > >> > org.apache.airavata.helix.core.AbstractTask.sendNextJob(AbstractTask.java:136) > >> >> >> >> > > > > > at > >> >> >> >> > > > > >> org.apache.airavata.helix.core.OutPort.invoke(OutPort.java:42) > >> >> >> >> > > > > > at > >> >> >> >> > > > > > > >> >> >> >> > > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > >> >> >> >> > > > >> >> >> >> > > >> >> >> >> > >> >> >> > >> >> > >> > org.apache.airavata.helix.core.AbstractTask.onSuccess(AbstractTask.java:123) > >> >> >> >> > > > > > at > >> >> >> >> > > > > > > >> >> >> >> > > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > >> >> >> >> > > > >> >> >> >> > > >> >> >> >> > >> >> >> > >> >> > >> > org.apache.airavata.helix.impl.task.AiravataTask.onSuccess(AiravataTask.java:97) > >> >> >> >> > > > > > at > >> >> >> >> > > > > > > >> >> >> >> > > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > >> >> >> >> > > > >> >> >> >> > > >> >> >> >> > >> >> >> > >> >> > >> > org.apache.airavata.helix.impl.task.env.EnvSetupTask.onRun(EnvSetupTask.java:52) > >> >> >> >> > > > > > at > >> >> >> >> > > > > > > >> >> >> >> > > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > >> >> >> >> > > > >> >> >> >> > > >> >> >> >> > >> >> >> > >> >> > >> > org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:349) > >> >> >> >> > > > > > at > >> >> >> >> > > > > > > >> >> >> >> > > >> >> org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:92) > >> >> >> >> > > > > > at > >> >> >> >> org.apache.helix.task.TaskRunner.run(TaskRunner.java:71) > >> >> >> >> > > > > > at > >> >> >> >> > > > > > > >> >> >> >> > > > >> >> >> >> > >> >> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > >> >> >> >> > > > > > at > >> >> >> >> java.util.concurrent.FutureTask.run(FutureTask.java:266) > >> >> >> >> > > > > > at > >> >> >> >> > > > > > > >> >> >> >> > > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > >> >> >> >> > > > >> >> >> >> > > >> >> >> >> > >> >> >> > >> >> > >> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > >> >> >> >> > > > > > at > >> >> >> >> > > > > > > >> >> >> >> > > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > >> >> >> >> > > > >> >> >> >> > > >> >> >> >> > >> >> >> > >> >> > >> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > >> >> >> >> > > > > > at > >> >> >> >> > > > > > > >> >> >> >> > > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > >> >> >> >> > > > >> >> >> >> > > >> >> >> >> > >> >> >> > >> >> > >> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > >> >> >> >> > > > > > at > >> >> >> >> > > > > > > >> >> >> >> > > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > >> >> >> >> > > > >> >> >> >> > > >> >> >> >> > >> >> >> > >> >> > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > >> >> >> >> > > > > > at java.lang.Thread.run(Thread.java:748) > >> >> >> >> > > > > > > >> >> >> >> > > > > > Thanks > >> >> >> >> > > > > > Dimuthu > >> >> >> >> > > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > -- > >> >> >> >> > > > > Junkai Xue > >> >> >> >> > > > > > >> >> >> >> > > > > >> >> >> >> > > > >> >> >> >> > > > >> >> >> >> > > -- > >> >> >> >> > > Junkai Xue > >> >> >> >> > > > >> >> >> >> > > >> >> >> >> > >> >> >> >> > >> >> >> >> -- > >> >> >> >> Junkai Xue > >> >> >> >> > >> >> >> > > >> >> >> > >> >> > > >> >> > > >> >> > -- > >> >> > Junkai Xue > >> >> > > >> >> > >> > > >> > > >> > -- > >> > Junkai Xue > >> > > >> > > > > > > -- > > Junkai Xue > > > -- Junkai Xue