No problem. If you have further questions, let us know what kind of load you're putting on Helix as well. The newest version of Helix contains Task Framework 2.0, and has greater scalability in scheduling tasks, so you might want to consider using the newest version as well.
Hunter On Fri, Mar 22, 2019 at 8:59 AM DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote: > Hi Lee, > > Thanks for the trick. I didn't know that we can poke the controller like > that :) However now we can see that tasks are moving smoothly in our > staging setup. This behavior can be seen from time to time and get resolved > automatically in few hours. I can't find a particular pattern however my > best guess is that this happens when the load is high. I will put some load > on testing setup and see if I can reproduce this issue and try your > instructions then get back to you > > Thanks > Dimuthu > > On Thu, Mar 21, 2019 at 5:27 PM Hunter Lee <naren...@gmail.com> wrote: > > > Hi Dimuthu, > > > > What Junkai meant by touching the IdealState is this: > > > > 1) use Zooinspector to log into ZK > > 2) Locate the IDEALSTATES/ path > > 3) grab any ZNode under that path and try to modify (just add a > > whitespace) and save > > 4) This will trigger a ZK callback which should tell Helix Controller to > > rebalance/schedule things > > > > On Thu, Mar 21, 2019 at 11:30 AM DImuthu Upeksha < > > dimuthu.upeks...@gmail.com> wrote: > > > >> Hi Junkai, > >> > >> What do you mean by touching ideal state to trigger an event? I didn't > >> quite get what you said. Is that like creating some path in zookeeper? > >> Workflows are eventually scheduled but the problem is, it is very slow > due > >> to that 30s freeze. > >> > >> Thanks > >> Dimuthu > >> > >> On Thu, Mar 21, 2019 at 2:26 PM Xue Junkai <junkai....@gmail.com> > wrote: > >> > >> > Can you try one thing? Touch the ideal state to trigger an event. If > >> > workflows are not scheduled, it should scheduling has problem. > >> > > >> > Best, > >> > > >> > Junkai > >> > > >> > On Wed, Mar 20, 2019 at 10:31 PM DImuthu Upeksha < > >> > dimuthu.upeks...@gmail.com> wrote: > >> > > >> >> Hi Junkai, > >> >> > >> >> We are using 0.8.1 > >> >> > >> >> Dimuthu > >> >> > >> >> On Thu, Mar 21, 2019 at 12:14 AM Xue Junkai <junkai....@gmail.com> > >> wrote: > >> >> > >> >> > Hi Dimuthu, > >> >> > > >> >> > What's the version of Helix you are using? > >> >> > > >> >> > Best, > >> >> > > >> >> > Junkai > >> >> > > >> >> > On Wed, Mar 20, 2019 at 8:54 PM DImuthu Upeksha < > >> >> > dimuthu.upeks...@gmail.com> > >> >> > wrote: > >> >> > > >> >> > > Hi Helix Dev, > >> >> > > > >> >> > > We are again seeing this delay in task execution. Please have a > >> look > >> >> at > >> >> > the > >> >> > > screencast [1] of logs printed in participant (top shell) and > >> >> controller > >> >> > > (bottom shell). When I record this, there were about 90 - 100 > >> >> workflows > >> >> > > pending to be executed. As you can see some tasks were suddenly > >> >> executed > >> >> > > and then participant freezed for about 30 seconds before > executing > >> >> next > >> >> > set > >> >> > > of tasks. I can see some WARN logs on controller log. I feel like > >> >> this 30 > >> >> > > second delay is some sort of a pattern. What do you think as the > >> >> reason > >> >> > for > >> >> > > this? I can provide you more information by turning on verbose > >> logs on > >> >> > > controller if you want. > >> >> > > > >> >> > > [1] https://youtu.be/3EUdSxnIxVw > >> >> > > > >> >> > > Thanks > >> >> > > Dimuthu > >> >> > > > >> >> > > On Thu, Oct 4, 2018 at 4:46 PM DImuthu Upeksha < > >> >> > dimuthu.upeks...@gmail.com > >> >> > > > > >> >> > > wrote: > >> >> > > > >> >> > > > Hi Junkai, > >> >> > > > > >> >> > > > I'm CCing Airavata dev list as this is directly related to the > >> >> project. > >> >> > > > > >> >> > > > I just went through the zookeeper path like /<Cluster > >> >> > Name>/EXTERNALVIEW, > >> >> > > > /<Cluster Name>/CONFIGS/RESOURCE as I have noticed that helix > >> >> > controller > >> >> > > is > >> >> > > > periodically monitoring for the children of those paths even > >> though > >> >> all > >> >> > > the > >> >> > > > Workflows have moved into a saturated state like COMPLETED and > >> >> STOPPED. > >> >> > > In > >> >> > > > our case, we have a lot of completed workflows piled up in > those > >> >> > paths. I > >> >> > > > believe that helix is clearing up those resources after some > TTL. > >> >> What > >> >> > I > >> >> > > > did was writing an external spectator [1] that continuously > >> monitors > >> >> > for > >> >> > > > saturated workflows and clearing up resources before controller > >> does > >> >> > that > >> >> > > > after a TTL. After that, we didn't see such delays in workflow > >> >> > execution > >> >> > > > and everything seems to be running smoothly. However we are > >> >> > continuously > >> >> > > > monitoring our deployments for any form of adverse effect > >> >> introduced by > >> >> > > > that improvement. > >> >> > > > > >> >> > > > Please let us know if we are doing something wrong in this > >> >> improvement > >> >> > or > >> >> > > > is there any better way to achieve this directly through helix > >> task > >> >> > > > framework. > >> >> > > > > >> >> > > > [1] > >> >> > > > > >> >> > > > >> >> > > >> >> > >> > https://github.com/apache/airavata/blob/staging/modules/airavata-helix/helix-spectator/src/main/java/org/apache/airavata/helix/impl/controller/WorkflowCleanupAgent.java > >> >> > > > > >> >> > > > Thanks > >> >> > > > Dimuthu > >> >> > > > > >> >> > > > On Tue, Oct 2, 2018 at 1:12 PM Xue Junkai < > junkai....@gmail.com> > >> >> > wrote: > >> >> > > > > >> >> > > >> Could you please check the log of how long for each pipeline > >> stage > >> >> > > takes? > >> >> > > >> > >> >> > > >> Also, did you set expiry for workflows? Are they piled up for > >> long > >> >> > time? > >> >> > > >> How long for each workflow completes? > >> >> > > >> > >> >> > > >> best, > >> >> > > >> > >> >> > > >> Junkai > >> >> > > >> > >> >> > > >> On Wed, Sep 26, 2018 at 8:52 AM DImuthu Upeksha < > >> >> > > >> dimuthu.upeks...@gmail.com> > >> >> > > >> wrote: > >> >> > > >> > >> >> > > >> > Hi Junkai, > >> >> > > >> > > >> >> > > >> > Average load is like 10 - 20 workflows per minutes. In some > >> cases > >> >> > it's > >> >> > > >> less > >> >> > > >> > than that However based on the observations, I feel like it > >> does > >> >> not > >> >> > > >> depend > >> >> > > >> > on the load and it is sporadic. Is there a particular log > >> lines > >> >> > that I > >> >> > > >> can > >> >> > > >> > filter in controller and participant to capture the timeline > >> of > >> >> > > >> workflow so > >> >> > > >> > that I can figure out which which component is > >> malfunctioning? We > >> >> > use > >> >> > > >> helix > >> >> > > >> > v 0.8.1. > >> >> > > >> > > >> >> > > >> > Thanks > >> >> > > >> > Dimuthu > >> >> > > >> > > >> >> > > >> > On Tue, Sep 25, 2018 at 5:19 PM Xue Junkai < > >> junkai....@gmail.com > >> >> > > >> >> > > >> wrote: > >> >> > > >> > > >> >> > > >> > > Hi Dimuthu, > >> >> > > >> > > > >> >> > > >> > > At which rate, you are keep submitting workflows? Usually, > >> >> > Workflow > >> >> > > >> > > scheduling is very fast. And which version of Helix you > are > >> >> using? > >> >> > > >> > > > >> >> > > >> > > Best, > >> >> > > >> > > > >> >> > > >> > > Junkai > >> >> > > >> > > > >> >> > > >> > > On Tue, Sep 25, 2018 at 8:58 AM DImuthu Upeksha < > >> >> > > >> > > dimuthu.upeks...@gmail.com> > >> >> > > >> > > wrote: > >> >> > > >> > > > >> >> > > >> > > > Hi Folks, > >> >> > > >> > > > > >> >> > > >> > > > We have noticed some delays between workflow submission > >> and > >> >> > actual > >> >> > > >> > > picking > >> >> > > >> > > > up by participants and seems like that delay is somewhat > >> >> > constant > >> >> > > >> > around > >> >> > > >> > > 2- > >> >> > > >> > > > 3 minutes. We used to continuously submit workflows and > >> >> after 2 > >> >> > -3 > >> >> > > >> > > minutes, > >> >> > > >> > > > a bulk of workflows are picked by participant and > execute > >> >> them. > >> >> > > >> Then it > >> >> > > >> > > > remain silent for next 2 -3 minutes event we submit more > >> >> > > workflows. > >> >> > > >> > It's > >> >> > > >> > > > like participant picking up workflows in discrete time > >> >> > intervals. > >> >> > > >> I'm > >> >> > > >> > not > >> >> > > >> > > > sure whether this is an issue of controller or the > >> >> participant. > >> >> > Do > >> >> > > >> you > >> >> > > >> > > have > >> >> > > >> > > > any experience with this sort of behavior? > >> >> > > >> > > > > >> >> > > >> > > > Thanks > >> >> > > >> > > > Dimuthu > >> >> > > >> > > > > >> >> > > >> > > > >> >> > > >> > > > >> >> > > >> > > -- > >> >> > > >> > > Junkai Xue > >> >> > > >> > > > >> >> > > >> > > >> >> > > >> > >> >> > > >> > >> >> > > >> -- > >> >> > > >> Junkai Xue > >> >> > > >> > >> >> > > > > >> >> > > > >> >> > > >> >> > > >> >> > -- > >> >> > Junkai Xue > >> >> > > >> >> > >> > > >> > > >> > -- > >> > Junkai Xue > >> > > >> > > >