[ https://issues.apache.org/jira/browse/HELIX-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298818#comment-16298818 ]
Hudson commented on HELIX-655: ------------------------------ FAILURE: Integrated in Jenkins build helix #1383 (See [https://builds.apache.org/job/helix/1383/]) [HELIX-655] Helix per-participant concurrent task throttling (jxue: rev 63553f3f64efc60dfa0c28e317977bfb66ea0813) * (edit) helix-core/src/test/java/org/apache/helix/integration/task/TestJobFailureTaskNotStarted.java * (edit) helix-core/src/main/java/org/apache/helix/model/ClusterConfig.java * (add) helix-core/src/test/java/org/apache/helix/integration/task/TestTaskThrottling.java * (edit) helix-core/src/test/java/org/apache/helix/integration/task/TestJobTimeoutTaskNotStarted.java * (edit) helix-core/src/main/java/org/apache/helix/controller/stages/BestPossibleStateCalcStage.java * (edit) helix-core/src/main/java/org/apache/helix/controller/stages/ClusterDataCache.java * (edit) helix-core/src/main/java/org/apache/helix/controller/stages/CurrentStateOutput.java * (edit) helix-core/src/main/java/org/apache/helix/model/InstanceConfig.java * (edit) helix-core/src/main/java/org/apache/helix/task/JobRebalancer.java > Helix per-participant concurrent task throttling > ------------------------------------------------ > > Key: HELIX-655 > URL: https://issues.apache.org/jira/browse/HELIX-655 > Project: Apache Helix > Issue Type: New Feature > Components: helix-core > Affects Versions: 0.6.x > Reporter: Jiajun Wang > Assignee: Junkai Xue > > h1. Overview > Currently, all runnable jobs/tasks in Helix are equally treated. They are all > scheduled according to the rebalancer algorithm. Means, their assignment > might be different, but they will all be in RUNNING state. > This may cause an issue if there are too many concurrently runnable jobs. > When Helix controller starts all these jobs, the instances may be overload as > they are assigning resources and executing all the tasks. As a result, the > jobs won't be able to finish in a reasonable time window. > The issue is even more critical to long run jobs. According to our meeting > with Gobblin team, when a job is scheduled, they allocate resource for the > job. So in the situation described above, more and more resources will be > reserved for the pending jobs. The cluster will soon be exhausted. > For solving the problem, an application needs to schedule jobs in a > relatively low frequency (what Gobblin is doing now). This may cause low > utilization. > A better way to fix this issue, at framework level, is throttling jobs/tasks > that are running concurrently, and allowing setting priority for different > jobs to control total execute time. > So given same amount of jobs, the cluster is in a better condition. As a > result, jobs running in that cluster have a more controllable execute time. > Existing related control mechanisms are: > * ConcurrentTasksPerInstance for each job > * ParallelJobs for each workflow > * Threadpool limitation on the participant if user customizes > TaskStateModelFactory. > But none of them can directly help when concurrent workflows or jobs number > is large. If an application keeps scheduling jobs/jobQueues, Helix will start > any runnable jobs without considering the workload on the participants. > The application may be able to carefully configures these items to achieve > the goal. But they won't be able to easily find the sweet spot. Especially > the cluster might be changing (scale out etc.). > h2. Problem summary > # All runnable tasks will start executing, which may overload the participant. > # Application needs a mechanism to prioritize important jobs (or workflows). > Otherwise, important tasks may be blocked by other less important ones. And > allocated resource is wasted. > h2. Feature proposed > Based on our discussing, we proposed 2 features that can help to resolve the > issue. > # Running task throttling on each participant. This is for avoiding overload. > # Job priority control that ensures high priority jobs are scheduled earlier. > In addition, application can leverage workflow/job monitor items as feedback > from Helix to adjust their stretgy. -- This message was sent by Atlassian JIRA (v6.4.14#64029)