Yes. You are right.

On Tue, Nov 13, 2018 at 7:32 AM DImuthu Upeksha <dimuthu.upeks...@gmail.com>
wrote:

> Hi Junkai,
>
> Thanks a lot. I'll try with expiry time then. Is this[1] the place where
> Helix has implemented this logic? If that so, default expiry time should be
> 24 hours. Am I right?
>
> [1]
>
> https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/task/TaskUtil.java#L711
>
> Thanks
> Dimuthu
>
> On Mon, Nov 12, 2018 at 10:17 PM Xue Junkai <junkai....@gmail.com> wrote:
>
> > 1 and 2 are correct. 3 is wrong. The expiry time start counting only when
> > the workflow is completed. If it is not scheduled ( dont have enought
> > resource) or still running, Helix never deletes it.
> >
> >
> >
> > On Sun, Nov 11, 2018 at 8:01 PM DImuthu Upeksha <
> > dimuthu.upeks...@gmail.com> wrote:
> >
> >> Hi Junkai,
> >>
> >> Thanks for the clarification. That helped a lot.
> >>
> >> I our case, each of the task of the workflow are depending on the
> previous
> >> task. So there is no parallel execution. And we are not using Job
> Queues.
> >>
> >> Regarding the expiry time, what are the rules that you are imposing on
> >> that? For example let's say I setup an expiry time to 2 hours, I assume
> >> following situations are covered in Helix,
> >>
> >> 1. Even though the workflow is completed before 2 hours, resources
> related
> >> to that workflow will not be cleared until 2 hours are elapsed and
> exactly
> >> after 2 hours, all the resources will be cleared by the framework.
> >> 2. If the workflow failed, resources will not be cleared even after 2
> >> hours
> >> 3. If the workflow wasn't scheduled within 2 hours in a participant, it
> >> will be deleted
> >>
> >> Is my understanding correct?
> >>
> >> Thanks
> >> Dimuthu
> >>
> >>
> >> On Sat, Nov 10, 2018 at 4:26 PM Xue Junkai <junkai....@gmail.com>
> wrote:
> >>
> >> > Hi Dimuthu,
> >> >
> >> > Couple things here:
> >> > 1. Only JobQueue in Helix is single branch DAG and 1 job running at a
> >> time
> >> > with defining parallel job number to be 1. Otherwise, you may see many
> >> jobs
> >> > running at same time as you set parallel job number to be a different
> >> > number. For generic workflow, all jobs without dependencies could be
> >> > dispatched together.
> >> > 2. Helix only cleans up the completed generic workflows by deleting
> all
> >> > the related znode, not for JobQueue. For JobQueue you have to set up
> >> > periodical purge time. As Helix defined, JobQueue never finishes and
> >> only
> >> > can be terminated by manual kill and it can keep accepting dynamic
> jobs.
> >> > Thus you have to understand your workflow is generic workflow or
> >> JobQueue.
> >> > For failed generic workflow, even if you setup the expiry time, Helix
> >> will
> >> > not clean it up as Helix would like to keep it for user further
> >> > investigation.
> >> > 3. For Helix controller, if Helix failed to clean up workflows, the
> only
> >> > thing you can see is the having workflows with context but no resource
> >> > config and idealstate there. This is because of ZK write fail to clean
> >> last
> >> > piece, context node. And there is no ideal state can trigger clean up
> >> again
> >> > for this workflow.
> >> >
> >> > Please take a look for this task framework tutorial for detailed
> >> > configurations:
> >> > https://helix.apache.org/0.8.2-docs/tutorial_task_framework.html
> >> >
> >> > Best,
> >> >
> >> > Junkai
> >> >
> >> > On Sat, Nov 10, 2018 at 8:29 AM DImuthu Upeksha <
> >> > dimuthu.upeks...@gmail.com> wrote:
> >> >
> >> >> Hi Junkai,
> >> >>
> >> >> Thanks for the clarification. There are few special properties in our
> >> >> workflows. All the workflows are single branch DAGs so there will be
> >> only
> >> >> one job running at a time. By looking at the log, I could see that
> only
> >> >> the
> >> >> task with this error has been failed. Cleanup agent deleted this
> >> workflow
> >> >> after this task is failed so it is clear that no other task is
> >> triggering
> >> >> this issue (I checked the timestamp).
> >> >>
> >> >> However for the instance, I disabled the cleanup agent for a while.
> >> Reason
> >> >> for adding this agent is because Helix became slow to schedule
> pending
> >> >> jobs
> >> >> when the load is high and participant was waiting without running
> >> anything
> >> >> for few minutes. We discussed this on thread "Sporadic delays in task
> >> >> execution". Before implementing this agent, I noticed that, there
> were
> >> >> lots
> >> >> of uncleared znodes related to Completed and Failed workflows and I
> >> though
> >> >> that was the reason to slow down controller / participant. After
> >> >> implementing this agent, things went smoothly until this point.
> >> >>
> >> >> Now I understand that you have your own workflow cleanup logic
> >> implemented
> >> >> in Helix but we might need to tune it to our case. Can you point me
> >> into
> >> >> code / documentation where I can have an idea about that?
> >> >>
> >> >> And this for my understanding, let's say that for some reason Helix
> >> failed
> >> >> to clean up completed workflows and related resources in zk. Will
> that
> >> >> affect to the performance of controller / participant? My
> understanding
> >> >> was
> >> >> that Helix was registering zk watchers for all the paths irrespective
> >> of
> >> >> the status of the workflow/ job/ task. Please correct me if I'm
> wrong.
> >> >>
> >> >> Thanks
> >> >> Dimuthu
> >> >>
> >> >> On Sat, Nov 10, 2018 at 1:49 AM Xue Junkai <junkai....@gmail.com>
> >> wrote:
> >> >>
> >> >> > It is possible. For example, if other jobs caused the workflow
> >> failed,
> >> >> it
> >> >> > will trigger the monitoring to clean up the workflow. Then if this
> >> job
> >> >> is
> >> >> > still running, you may see the problem. That's what I am trying to
> >> ask
> >> >> for,
> >> >> > extra thread deleting/cleaning workflows.
> >> >> >
> >> >> > I can understand it clean up the failed workflow. But I am
> wondering
> >> why
> >> >> > not just set expiry and let Helix controller does the clean up for
> >> >> > completed workflows.
> >> >> >
> >> >> > On Sat, Nov 10, 2018 at 1:30 PM DImuthu Upeksha <
> >> >> > dimuthu.upeks...@gmail.com> wrote:
> >> >> >
> >> >> >> Hi Junkai,
> >> >> >>
> >> >> >> There is a cleanup agent [1] who is monitoring currently available
> >> >> >> workflows and deleting completed and failed workflows to clear up
> >> >> >> zookeeper
> >> >> >> storage. Do you think that this will be causing this issue?
> >> >> >>
> >> >> >> [1]
> >> >> >>
> >> >> >>
> >> >>
> >>
> https://github.com/apache/airavata/blob/staging/modules/airavata-helix/helix-spectator/src/main/java/org/apache/airavata/helix/impl/controller/WorkflowCleanupAgent.java
> >> >> >>
> >> >> >> Thanks
> >> >> >> Dimuthu
> >> >> >>
> >> >> >> On Fri, Nov 9, 2018 at 11:14 PM DImuthu Upeksha <
> >> >> >> dimuthu.upeks...@gmail.com>
> >> >> >> wrote:
> >> >> >>
> >> >> >> > Hi Junkai,
> >> >> >> >
> >> >> >> > There is no manual workflow killing logic implemented but as you
> >> have
> >> >> >> > suggested, I need to verify that. Unfortunately all the helix
> log
> >> >> >> levels in
> >> >> >> > our servers were set to WARN as helix is printing a whole lot of
> >> >> logs in
> >> >> >> > INFO level so there is no much valuable information in logs. Can
> >> you
> >> >> >> > specify which class is printing logs associated for workflow
> >> >> termination
> >> >> >> > and I'll enable DEBUG level for that class and observe further.
> >> >> >> >
> >> >> >> > Thanks
> >> >> >> > Dimuthu
> >> >> >> >
> >> >> >> > On Fri, Nov 9, 2018 at 9:18 PM Xue Junkai <junkai....@gmail.com
> >
> >> >> wrote:
> >> >> >> >
> >> >> >> >> Hmm, that's very strange. The user content store znode only has
> >> been
> >> >> >> >> deleted when the workflow is gone. From the log, it shows the
> >> znode
> >> >> is
> >> >> >> >> gone. Could you please try to dig the log to find whether the
> >> >> workflow
> >> >> >> has
> >> >> >> >> been manually killed? If that's the case, then it is possible
> you
> >> >> have
> >> >> >> the
> >> >> >> >> problem.
> >> >> >> >>
> >> >> >> >> On Fri, Nov 9, 2018 at 12:13 PM DImuthu Upeksha <
> >> >> >> >> dimuthu.upeks...@gmail.com>
> >> >> >> >> wrote:
> >> >> >> >>
> >> >> >> >> > Hi Junkai,
> >> >> >> >> >
> >> >> >> >> > Thanks for your suggestion. You have captured most of the
> parts
> >> >> >> >> correctly.
> >> >> >> >> > There are two jobs as job1 and job2. And there is a
> dependency
> >> >> that
> >> >> >> job2
> >> >> >> >> > depends on job1. Until job1 is completed job2 should not be
> >> >> >> scheduled.
> >> >> >> >> And
> >> >> >> >> > task 1 in job 1 is calling that method and it is not updating
> >> >> >> anyone's
> >> >> >> >> > content. It's just putting and value in workflow level. What
> do
> >> >> you
> >> >> >> >> mean my
> >> >> >> >> > keeping a key-value store in workflow level? I already use
> that
> >> >> key
> >> >> >> >> value
> >> >> >> >> > store given by helix by calling putUserContent method.
> >> >> >> >> >
> >> >> >> >> > public void sendNextJob(String jobId) {
> >> >> >> >> >     putUserContent(WORKFLOW_STARTED, "TRUE", Scope.WORKFLOW);
> >> >> >> >> >     if (jobId != null) {
> >> >> >> >> >         putUserContent(NEXT_JOB, jobId, Scope.WORKFLOW);
> >> >> >> >> >     }
> >> >> >> >> > }
> >> >> >> >> >
> >> >> >> >> > Dimuthu
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > On Fri, Nov 9, 2018 at 2:48 PM Xue Junkai <
> >> junkai....@gmail.com>
> >> >> >> wrote:
> >> >> >> >> >
> >> >> >> >> > > In my understanding, it could be you have job1 and job2.
> The
> >> >> task
> >> >> >> >> running
> >> >> >> >> > > in job1 tries to update content for job2. Then, there could
> >> be a
> >> >> >> race
> >> >> >> >> > > condition happening here that job2 is not scheduled.
> >> >> >> >> > >
> >> >> >> >> > > If that's the case, I suggest you can put key-value store
> at
> >> >> >> workflow
> >> >> >> >> > level
> >> >> >> >> > > since this is cross-job operation.
> >> >> >> >> > >
> >> >> >> >> > > Best,
> >> >> >> >> > >
> >> >> >> >> > > Junkai
> >> >> >> >> > >
> >> >> >> >> > > On Fri, Nov 9, 2018 at 11:45 AM DImuthu Upeksha <
> >> >> >> >> > > dimuthu.upeks...@gmail.com>
> >> >> >> >> > > wrote:
> >> >> >> >> > >
> >> >> >> >> > > > Hi Junkai,
> >> >> >> >> > > >
> >> >> >> >> > > > This method is being called inside a running task. And it
> >> is
> >> >> >> working
> >> >> >> >> > for
> >> >> >> >> > > > most of the time. I only saw this in 2 occasions for last
> >> few
> >> >> >> months
> >> >> >> >> > and
> >> >> >> >> > > > both of them happened today and yesterday.
> >> >> >> >> > > >
> >> >> >> >> > > > Thanks
> >> >> >> >> > > > Dimuthu
> >> >> >> >> > > >
> >> >> >> >> > > > On Fri, Nov 9, 2018 at 2:40 PM Xue Junkai <
> >> >> junkai....@gmail.com>
> >> >> >> >> > wrote:
> >> >> >> >> > > >
> >> >> >> >> > > > > User content store node will be created one the job has
> >> been
> >> >> >> >> > scheduled.
> >> >> >> >> > > > In
> >> >> >> >> > > > > your case, I think the job is not scheduled. This
> method
> >> >> >> usually
> >> >> >> >> has
> >> >> >> >> > > been
> >> >> >> >> > > > > utilized in running task.
> >> >> >> >> > > > >
> >> >> >> >> > > > > Best,
> >> >> >> >> > > > >
> >> >> >> >> > > > > Junkai
> >> >> >> >> > > > >
> >> >> >> >> > > > > On Fri, Nov 9, 2018 at 8:19 AM DImuthu Upeksha <
> >> >> >> >> > > > dimuthu.upeks...@gmail.com
> >> >> >> >> > > > > >
> >> >> >> >> > > > > wrote:
> >> >> >> >> > > > >
> >> >> >> >> > > > > > Hi Helix Folks,
> >> >> >> >> > > > > >
> >> >> >> >> > > > > > I'm having this sporadic issue in some tasks of our
> >> >> workflows
> >> >> >> >> when
> >> >> >> >> > we
> >> >> >> >> > > > try
> >> >> >> >> > > > > > to store a value in the workflow context and I have
> >> added
> >> >> >> both
> >> >> >> >> code
> >> >> >> >> > > > > section
> >> >> >> >> > > > > > and error message below. Do you have an idea what's
> >> >> causing
> >> >> >> >> this?
> >> >> >> >> > > > Please
> >> >> >> >> > > > > > let me know if you need further information. We are
> >> using
> >> >> >> Helix
> >> >> >> >> > 0.8.2
> >> >> >> >> > > > > >
> >> >> >> >> > > > > > public void sendNextJob(String jobId) {
> >> >> >> >> > > > > >     putUserContent(WORKFLOW_STARTED, "TRUE",
> >> >> Scope.WORKFLOW);
> >> >> >> >> > > > > >     if (jobId != null) {
> >> >> >> >> > > > > >         putUserContent(NEXT_JOB, jobId,
> >> Scope.WORKFLOW);
> >> >> >> >> > > > > >     }
> >> >> >> >> > > > > > }
> >> >> >> >> > > > > >
> >> >> >> >> > > > > > Failed to setup environment of task
> >> >> >> >> > > > > > TASK_55096de4-2cb6-4b09-84fd-7fdddba93435
> >> >> >> >> > > > > > java.lang.NullPointerException: null
> >> >> >> >> > > > > >         at
> >> >> >> >> > org.apache.helix.task.TaskUtil$1.update(TaskUtil.java:358)
> >> >> >> >> > > > > >         at
> >> >> >> >> > org.apache.helix.task.TaskUtil$1.update(TaskUtil.java:356)
> >> >> >> >> > > > > >         at
> >> >> >> >> > > > > >
> >> >> >> >> > > > > >
> >> >> >> >> > > > >
> >> >> >> >> > > >
> >> >> >> >> > >
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.helix.manager.zk.HelixGroupCommit.commit(HelixGroupCommit.java:126)
> >> >> >> >> > > > > >         at
> >> >> >> >> > > > > >
> >> >> >> >> > > > > >
> >> >> >> >> > > > >
> >> >> >> >> > > >
> >> >> >> >> > >
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.helix.manager.zk.ZkCacheBaseDataAccessor.update(ZkCacheBaseDataAccessor.java:306)
> >> >> >> >> > > > > >         at
> >> >> >> >> > > > > >
> >> >> >> >> > > > > >
> >> >> >> >> > > > >
> >> >> >> >> > > >
> >> >> >> >> > >
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.helix.store.zk.AutoFallbackPropertyStore.update(AutoFallbackPropertyStore.java:61)
> >> >> >> >> > > > > >         at
> >> >> >> >> > > > > >
> >> >> >> >> > > > >
> >> >> >> >> > > >
> >> >> >> >> > >
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.helix.task.TaskUtil.addWorkflowJobUserContent(TaskUtil.java:356)
> >> >> >> >> > > > > >         at
> >> >> >> >> > > > > >
> >> >> >> >> > > > > >
> >> >> >> >> > > > >
> >> >> >> >> > > >
> >> >> >> >> > >
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.helix.task.UserContentStore.putUserContent(UserContentStore.java:78)
> >> >> >> >> > > > > >         at
> >> >> >> >> > > > > >
> >> >> >> >> > > > > >
> >> >> >> >> > > > >
> >> >> >> >> > > >
> >> >> >> >> > >
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.airavata.helix.core.AbstractTask.sendNextJob(AbstractTask.java:136)
> >> >> >> >> > > > > >         at
> >> >> >> >> > > >
> >> org.apache.airavata.helix.core.OutPort.invoke(OutPort.java:42)
> >> >> >> >> > > > > >         at
> >> >> >> >> > > > > >
> >> >> >> >> > > > > >
> >> >> >> >> > > > >
> >> >> >> >> > > >
> >> >> >> >> > >
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.airavata.helix.core.AbstractTask.onSuccess(AbstractTask.java:123)
> >> >> >> >> > > > > >         at
> >> >> >> >> > > > > >
> >> >> >> >> > > > > >
> >> >> >> >> > > > >
> >> >> >> >> > > >
> >> >> >> >> > >
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.airavata.helix.impl.task.AiravataTask.onSuccess(AiravataTask.java:97)
> >> >> >> >> > > > > >         at
> >> >> >> >> > > > > >
> >> >> >> >> > > > > >
> >> >> >> >> > > > >
> >> >> >> >> > > >
> >> >> >> >> > >
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.airavata.helix.impl.task.env.EnvSetupTask.onRun(EnvSetupTask.java:52)
> >> >> >> >> > > > > >         at
> >> >> >> >> > > > > >
> >> >> >> >> > > > > >
> >> >> >> >> > > > >
> >> >> >> >> > > >
> >> >> >> >> > >
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:349)
> >> >> >> >> > > > > >         at
> >> >> >> >> > > > > >
> >> >> >> >> >
> >> >> org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:92)
> >> >> >> >> > > > > >         at
> >> >> >> >> org.apache.helix.task.TaskRunner.run(TaskRunner.java:71)
> >> >> >> >> > > > > >         at
> >> >> >> >> > > > > >
> >> >> >> >> > >
> >> >> >> >>
> >> >>
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> >> >> >> >> > > > > >         at
> >> >> >> >> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> >> >> >> > > > > >         at
> >> >> >> >> > > > > >
> >> >> >> >> > > > > >
> >> >> >> >> > > > >
> >> >> >> >> > > >
> >> >> >> >> > >
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> >> >> >> >> > > > > >         at
> >> >> >> >> > > > > >
> >> >> >> >> > > > > >
> >> >> >> >> > > > >
> >> >> >> >> > > >
> >> >> >> >> > >
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> >> >> >> >> > > > > >         at
> >> >> >> >> > > > > >
> >> >> >> >> > > > > >
> >> >> >> >> > > > >
> >> >> >> >> > > >
> >> >> >> >> > >
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >> >> >> >> > > > > >         at
> >> >> >> >> > > > > >
> >> >> >> >> > > > > >
> >> >> >> >> > > > >
> >> >> >> >> > > >
> >> >> >> >> > >
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >> >> >> >> > > > > >         at java.lang.Thread.run(Thread.java:748)
> >> >> >> >> > > > > >
> >> >> >> >> > > > > > Thanks
> >> >> >> >> > > > > > Dimuthu
> >> >> >> >> > > > > >
> >> >> >> >> > > > >
> >> >> >> >> > > > >
> >> >> >> >> > > > > --
> >> >> >> >> > > > > Junkai Xue
> >> >> >> >> > > > >
> >> >> >> >> > > >
> >> >> >> >> > >
> >> >> >> >> > >
> >> >> >> >> > > --
> >> >> >> >> > > Junkai Xue
> >> >> >> >> > >
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> Junkai Xue
> >> >> >> >>
> >> >> >> >
> >> >> >>
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Junkai Xue
> >> >> >
> >> >>
> >> >
> >> >
> >> > --
> >> > Junkai Xue
> >> >
> >>
> >
> >
> > --
> > Junkai Xue
> >
>


-- 
Junkai Xue

Reply via email to