Re: Sporadic issue in putting a variable in workflow scope

DImuthu Upeksha Tue, 13 Nov 2018 07:32:09 -0800

Hi Junkai,

Thanks a lot. I'll try with expiry time then. Is this[1] the place where
Helix has implemented this logic? If that so, default expiry time should be
24 hours. Am I right?


[1]
https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/task/TaskUtil.java#L711

Thanks
Dimuthu

On Mon, Nov 12, 2018 at 10:17 PM Xue Junkai <junkai....@gmail.com> wrote:

> 1 and 2 are correct. 3 is wrong. The expiry time start counting only when
> the workflow is completed. If it is not scheduled ( dont have enought
> resource) or still running, Helix never deletes it.
>
>
>
> On Sun, Nov 11, 2018 at 8:01 PM DImuthu Upeksha <
> dimuthu.upeks...@gmail.com> wrote:
>
>> Hi Junkai,
>>
>> Thanks for the clarification. That helped a lot.
>>
>> I our case, each of the task of the workflow are depending on the previous
>> task. So there is no parallel execution. And we are not using Job Queues.
>>
>> Regarding the expiry time, what are the rules that you are imposing on
>> that? For example let's say I setup an expiry time to 2 hours, I assume
>> following situations are covered in Helix,
>>
>> 1. Even though the workflow is completed before 2 hours, resources related
>> to that workflow will not be cleared until 2 hours are elapsed and exactly
>> after 2 hours, all the resources will be cleared by the framework.
>> 2. If the workflow failed, resources will not be cleared even after 2
>> hours
>> 3. If the workflow wasn't scheduled within 2 hours in a participant, it
>> will be deleted
>>
>> Is my understanding correct?
>>
>> Thanks
>> Dimuthu
>>
>>
>> On Sat, Nov 10, 2018 at 4:26 PM Xue Junkai <junkai....@gmail.com> wrote:
>>
>> > Hi Dimuthu,
>> >
>> > Couple things here:
>> > 1. Only JobQueue in Helix is single branch DAG and 1 job running at a
>> time
>> > with defining parallel job number to be 1. Otherwise, you may see many
>> jobs
>> > running at same time as you set parallel job number to be a different
>> > number. For generic workflow, all jobs without dependencies could be
>> > dispatched together.
>> > 2. Helix only cleans up the completed generic workflows by deleting all
>> > the related znode, not for JobQueue. For JobQueue you have to set up
>> > periodical purge time. As Helix defined, JobQueue never finishes and
>> only
>> > can be terminated by manual kill and it can keep accepting dynamic jobs.
>> > Thus you have to understand your workflow is generic workflow or
>> JobQueue.
>> > For failed generic workflow, even if you setup the expiry time, Helix
>> will
>> > not clean it up as Helix would like to keep it for user further
>> > investigation.
>> > 3. For Helix controller, if Helix failed to clean up workflows, the only
>> > thing you can see is the having workflows with context but no resource
>> > config and idealstate there. This is because of ZK write fail to clean
>> last
>> > piece, context node. And there is no ideal state can trigger clean up
>> again
>> > for this workflow.
>> >
>> > Please take a look for this task framework tutorial for detailed
>> > configurations:
>> > https://helix.apache.org/0.8.2-docs/tutorial_task_framework.html
>> >
>> > Best,
>> >
>> > Junkai
>> >
>> > On Sat, Nov 10, 2018 at 8:29 AM DImuthu Upeksha <
>> > dimuthu.upeks...@gmail.com> wrote:
>> >
>> >> Hi Junkai,
>> >>
>> >> Thanks for the clarification. There are few special properties in our
>> >> workflows. All the workflows are single branch DAGs so there will be
>> only
>> >> one job running at a time. By looking at the log, I could see that only
>> >> the
>> >> task with this error has been failed. Cleanup agent deleted this
>> workflow
>> >> after this task is failed so it is clear that no other task is
>> triggering
>> >> this issue (I checked the timestamp).
>> >>
>> >> However for the instance, I disabled the cleanup agent for a while.
>> Reason
>> >> for adding this agent is because Helix became slow to schedule pending
>> >> jobs
>> >> when the load is high and participant was waiting without running
>> anything
>> >> for few minutes. We discussed this on thread "Sporadic delays in task
>> >> execution". Before implementing this agent, I noticed that, there were
>> >> lots
>> >> of uncleared znodes related to Completed and Failed workflows and I
>> though
>> >> that was the reason to slow down controller / participant. After
>> >> implementing this agent, things went smoothly until this point.
>> >>
>> >> Now I understand that you have your own workflow cleanup logic
>> implemented
>> >> in Helix but we might need to tune it to our case. Can you point me
>> into
>> >> code / documentation where I can have an idea about that?
>> >>
>> >> And this for my understanding, let's say that for some reason Helix
>> failed
>> >> to clean up completed workflows and related resources in zk. Will that
>> >> affect to the performance of controller / participant? My understanding
>> >> was
>> >> that Helix was registering zk watchers for all the paths irrespective
>> of
>> >> the status of the workflow/ job/ task. Please correct me if I'm wrong.
>> >>
>> >> Thanks
>> >> Dimuthu
>> >>
>> >> On Sat, Nov 10, 2018 at 1:49 AM Xue Junkai <junkai....@gmail.com>
>> wrote:
>> >>
>> >> > It is possible. For example, if other jobs caused the workflow
>> failed,
>> >> it
>> >> > will trigger the monitoring to clean up the workflow. Then if this
>> job
>> >> is
>> >> > still running, you may see the problem. That's what I am trying to
>> ask
>> >> for,
>> >> > extra thread deleting/cleaning workflows.
>> >> >
>> >> > I can understand it clean up the failed workflow. But I am wondering
>> why
>> >> > not just set expiry and let Helix controller does the clean up for
>> >> > completed workflows.
>> >> >
>> >> > On Sat, Nov 10, 2018 at 1:30 PM DImuthu Upeksha <
>> >> > dimuthu.upeks...@gmail.com> wrote:
>> >> >
>> >> >> Hi Junkai,
>> >> >>
>> >> >> There is a cleanup agent [1] who is monitoring currently available
>> >> >> workflows and deleting completed and failed workflows to clear up
>> >> >> zookeeper
>> >> >> storage. Do you think that this will be causing this issue?
>> >> >>
>> >> >> [1]
>> >> >>
>> >> >>
>> >>
>> https://github.com/apache/airavata/blob/staging/modules/airavata-helix/helix-spectator/src/main/java/org/apache/airavata/helix/impl/controller/WorkflowCleanupAgent.java
>> >> >>
>> >> >> Thanks
>> >> >> Dimuthu
>> >> >>
>> >> >> On Fri, Nov 9, 2018 at 11:14 PM DImuthu Upeksha <
>> >> >> dimuthu.upeks...@gmail.com>
>> >> >> wrote:
>> >> >>
>> >> >> > Hi Junkai,
>> >> >> >
>> >> >> > There is no manual workflow killing logic implemented but as you
>> have
>> >> >> > suggested, I need to verify that. Unfortunately all the helix log
>> >> >> levels in
>> >> >> > our servers were set to WARN as helix is printing a whole lot of
>> >> logs in
>> >> >> > INFO level so there is no much valuable information in logs. Can
>> you
>> >> >> > specify which class is printing logs associated for workflow
>> >> termination
>> >> >> > and I'll enable DEBUG level for that class and observe further.
>> >> >> >
>> >> >> > Thanks
>> >> >> > Dimuthu
>> >> >> >
>> >> >> > On Fri, Nov 9, 2018 at 9:18 PM Xue Junkai <junkai....@gmail.com>
>> >> wrote:
>> >> >> >
>> >> >> >> Hmm, that's very strange. The user content store znode only has
>> been
>> >> >> >> deleted when the workflow is gone. From the log, it shows the
>> znode
>> >> is
>> >> >> >> gone. Could you please try to dig the log to find whether the
>> >> workflow
>> >> >> has
>> >> >> >> been manually killed? If that's the case, then it is possible you
>> >> have
>> >> >> the
>> >> >> >> problem.
>> >> >> >>
>> >> >> >> On Fri, Nov 9, 2018 at 12:13 PM DImuthu Upeksha <
>> >> >> >> dimuthu.upeks...@gmail.com>
>> >> >> >> wrote:
>> >> >> >>
>> >> >> >> > Hi Junkai,
>> >> >> >> >
>> >> >> >> > Thanks for your suggestion. You have captured most of the parts
>> >> >> >> correctly.
>> >> >> >> > There are two jobs as job1 and job2. And there is a dependency
>> >> that
>> >> >> job2
>> >> >> >> > depends on job1. Until job1 is completed job2 should not be
>> >> >> scheduled.
>> >> >> >> And
>> >> >> >> > task 1 in job 1 is calling that method and it is not updating
>> >> >> anyone's
>> >> >> >> > content. It's just putting and value in workflow level. What do
>> >> you
>> >> >> >> mean my
>> >> >> >> > keeping a key-value store in workflow level? I already use that
>> >> key
>> >> >> >> value
>> >> >> >> > store given by helix by calling putUserContent method.
>> >> >> >> >
>> >> >> >> > public void sendNextJob(String jobId) {
>> >> >> >> >     putUserContent(WORKFLOW_STARTED, "TRUE", Scope.WORKFLOW);
>> >> >> >> >     if (jobId != null) {
>> >> >> >> >         putUserContent(NEXT_JOB, jobId, Scope.WORKFLOW);
>> >> >> >> >     }
>> >> >> >> > }
>> >> >> >> >
>> >> >> >> > Dimuthu
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > On Fri, Nov 9, 2018 at 2:48 PM Xue Junkai <
>> junkai....@gmail.com>
>> >> >> wrote:
>> >> >> >> >
>> >> >> >> > > In my understanding, it could be you have job1 and job2. The
>> >> task
>> >> >> >> running
>> >> >> >> > > in job1 tries to update content for job2. Then, there could
>> be a
>> >> >> race
>> >> >> >> > > condition happening here that job2 is not scheduled.
>> >> >> >> > >
>> >> >> >> > > If that's the case, I suggest you can put key-value store at
>> >> >> workflow
>> >> >> >> > level
>> >> >> >> > > since this is cross-job operation.
>> >> >> >> > >
>> >> >> >> > > Best,
>> >> >> >> > >
>> >> >> >> > > Junkai
>> >> >> >> > >
>> >> >> >> > > On Fri, Nov 9, 2018 at 11:45 AM DImuthu Upeksha <
>> >> >> >> > > dimuthu.upeks...@gmail.com>
>> >> >> >> > > wrote:
>> >> >> >> > >
>> >> >> >> > > > Hi Junkai,
>> >> >> >> > > >
>> >> >> >> > > > This method is being called inside a running task. And it
>> is
>> >> >> working
>> >> >> >> > for
>> >> >> >> > > > most of the time. I only saw this in 2 occasions for last
>> few
>> >> >> months
>> >> >> >> > and
>> >> >> >> > > > both of them happened today and yesterday.
>> >> >> >> > > >
>> >> >> >> > > > Thanks
>> >> >> >> > > > Dimuthu
>> >> >> >> > > >
>> >> >> >> > > > On Fri, Nov 9, 2018 at 2:40 PM Xue Junkai <
>> >> junkai....@gmail.com>
>> >> >> >> > wrote:
>> >> >> >> > > >
>> >> >> >> > > > > User content store node will be created one the job has
>> been
>> >> >> >> > scheduled.
>> >> >> >> > > > In
>> >> >> >> > > > > your case, I think the job is not scheduled. This method
>> >> >> usually
>> >> >> >> has
>> >> >> >> > > been
>> >> >> >> > > > > utilized in running task.
>> >> >> >> > > > >
>> >> >> >> > > > > Best,
>> >> >> >> > > > >
>> >> >> >> > > > > Junkai
>> >> >> >> > > > >
>> >> >> >> > > > > On Fri, Nov 9, 2018 at 8:19 AM DImuthu Upeksha <
>> >> >> >> > > > dimuthu.upeks...@gmail.com
>> >> >> >> > > > > >
>> >> >> >> > > > > wrote:
>> >> >> >> > > > >
>> >> >> >> > > > > > Hi Helix Folks,
>> >> >> >> > > > > >
>> >> >> >> > > > > > I'm having this sporadic issue in some tasks of our
>> >> workflows
>> >> >> >> when
>> >> >> >> > we
>> >> >> >> > > > try
>> >> >> >> > > > > > to store a value in the workflow context and I have
>> added
>> >> >> both
>> >> >> >> code
>> >> >> >> > > > > section
>> >> >> >> > > > > > and error message below. Do you have an idea what's
>> >> causing
>> >> >> >> this?
>> >> >> >> > > > Please
>> >> >> >> > > > > > let me know if you need further information. We are
>> using
>> >> >> Helix
>> >> >> >> > 0.8.2
>> >> >> >> > > > > >
>> >> >> >> > > > > > public void sendNextJob(String jobId) {
>> >> >> >> > > > > >     putUserContent(WORKFLOW_STARTED, "TRUE",
>> >> Scope.WORKFLOW);
>> >> >> >> > > > > >     if (jobId != null) {
>> >> >> >> > > > > >         putUserContent(NEXT_JOB, jobId,
>> Scope.WORKFLOW);
>> >> >> >> > > > > >     }
>> >> >> >> > > > > > }
>> >> >> >> > > > > >
>> >> >> >> > > > > > Failed to setup environment of task
>> >> >> >> > > > > > TASK_55096de4-2cb6-4b09-84fd-7fdddba93435
>> >> >> >> > > > > > java.lang.NullPointerException: null
>> >> >> >> > > > > >         at
>> >> >> >> > org.apache.helix.task.TaskUtil$1.update(TaskUtil.java:358)
>> >> >> >> > > > > >         at
>> >> >> >> > org.apache.helix.task.TaskUtil$1.update(TaskUtil.java:356)
>> >> >> >> > > > > >         at
>> >> >> >> > > > > >
>> >> >> >> > > > > >
>> >> >> >> > > > >
>> >> >> >> > > >
>> >> >> >> > >
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> org.apache.helix.manager.zk.HelixGroupCommit.commit(HelixGroupCommit.java:126)
>> >> >> >> > > > > >         at
>> >> >> >> > > > > >
>> >> >> >> > > > > >
>> >> >> >> > > > >
>> >> >> >> > > >
>> >> >> >> > >
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> org.apache.helix.manager.zk.ZkCacheBaseDataAccessor.update(ZkCacheBaseDataAccessor.java:306)
>> >> >> >> > > > > >         at
>> >> >> >> > > > > >
>> >> >> >> > > > > >
>> >> >> >> > > > >
>> >> >> >> > > >
>> >> >> >> > >
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> org.apache.helix.store.zk.AutoFallbackPropertyStore.update(AutoFallbackPropertyStore.java:61)
>> >> >> >> > > > > >         at
>> >> >> >> > > > > >
>> >> >> >> > > > >
>> >> >> >> > > >
>> >> >> >> > >
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> org.apache.helix.task.TaskUtil.addWorkflowJobUserContent(TaskUtil.java:356)
>> >> >> >> > > > > >         at
>> >> >> >> > > > > >
>> >> >> >> > > > > >
>> >> >> >> > > > >
>> >> >> >> > > >
>> >> >> >> > >
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> org.apache.helix.task.UserContentStore.putUserContent(UserContentStore.java:78)
>> >> >> >> > > > > >         at
>> >> >> >> > > > > >
>> >> >> >> > > > > >
>> >> >> >> > > > >
>> >> >> >> > > >
>> >> >> >> > >
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> org.apache.airavata.helix.core.AbstractTask.sendNextJob(AbstractTask.java:136)
>> >> >> >> > > > > >         at
>> >> >> >> > > >
>> org.apache.airavata.helix.core.OutPort.invoke(OutPort.java:42)
>> >> >> >> > > > > >         at
>> >> >> >> > > > > >
>> >> >> >> > > > > >
>> >> >> >> > > > >
>> >> >> >> > > >
>> >> >> >> > >
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> org.apache.airavata.helix.core.AbstractTask.onSuccess(AbstractTask.java:123)
>> >> >> >> > > > > >         at
>> >> >> >> > > > > >
>> >> >> >> > > > > >
>> >> >> >> > > > >
>> >> >> >> > > >
>> >> >> >> > >
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> org.apache.airavata.helix.impl.task.AiravataTask.onSuccess(AiravataTask.java:97)
>> >> >> >> > > > > >         at
>> >> >> >> > > > > >
>> >> >> >> > > > > >
>> >> >> >> > > > >
>> >> >> >> > > >
>> >> >> >> > >
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> org.apache.airavata.helix.impl.task.env.EnvSetupTask.onRun(EnvSetupTask.java:52)
>> >> >> >> > > > > >         at
>> >> >> >> > > > > >
>> >> >> >> > > > > >
>> >> >> >> > > > >
>> >> >> >> > > >
>> >> >> >> > >
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:349)
>> >> >> >> > > > > >         at
>> >> >> >> > > > > >
>> >> >> >> >
>> >> org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:92)
>> >> >> >> > > > > >         at
>> >> >> >> org.apache.helix.task.TaskRunner.run(TaskRunner.java:71)
>> >> >> >> > > > > >         at
>> >> >> >> > > > > >
>> >> >> >> > >
>> >> >> >>
>> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> >> >> >> > > > > >         at
>> >> >> >> java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> >> >> >> > > > > >         at
>> >> >> >> > > > > >
>> >> >> >> > > > > >
>> >> >> >> > > > >
>> >> >> >> > > >
>> >> >> >> > >
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>> >> >> >> > > > > >         at
>> >> >> >> > > > > >
>> >> >> >> > > > > >
>> >> >> >> > > > >
>> >> >> >> > > >
>> >> >> >> > >
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>> >> >> >> > > > > >         at
>> >> >> >> > > > > >
>> >> >> >> > > > > >
>> >> >> >> > > > >
>> >> >> >> > > >
>> >> >> >> > >
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> >> >> >> > > > > >         at
>> >> >> >> > > > > >
>> >> >> >> > > > > >
>> >> >> >> > > > >
>> >> >> >> > > >
>> >> >> >> > >
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> >> >> >> > > > > >         at java.lang.Thread.run(Thread.java:748)
>> >> >> >> > > > > >
>> >> >> >> > > > > > Thanks
>> >> >> >> > > > > > Dimuthu
>> >> >> >> > > > > >
>> >> >> >> > > > >
>> >> >> >> > > > >
>> >> >> >> > > > > --
>> >> >> >> > > > > Junkai Xue
>> >> >> >> > > > >
>> >> >> >> > > >
>> >> >> >> > >
>> >> >> >> > >
>> >> >> >> > > --
>> >> >> >> > > Junkai Xue
>> >> >> >> > >
>> >> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >> Junkai Xue
>> >> >> >>
>> >> >> >
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> > Junkai Xue
>> >> >
>> >>
>> >
>> >
>> > --
>> > Junkai Xue
>> >
>>
>
>
> --
> Junkai Xue
>

Re: Sporadic issue in putting a variable in workflow scope

Reply via email to