GitHub user narendly opened a pull request:
https://github.com/apache/helix/pull/243
Add ThreadCountBasedAssignmentCalculator and integrate with
Workflow/JobRebalancer and fix rebalancing logic
â¦th Workflow/JobRebalancer and fix rebalancing logic
For quota-based scheduling of tasks, we have added the TaskAssigner
interface that takes into account AssignableInstances by way of
AssignableInstanceManager. In order to use this in the currently-existing
pipeline prior to Task Framework 2.0, GenericTaskAssignmentCalculator was
replaced with ThreadCountBasedAssignmentCalculator, which is a wrapper around
TaskAssigner. Necessary adjustments were made in Workflow/JobRebalancer for
this replacement. Also the rebalance logic in Workflow/JobRebalancer was
reviewed and fixed. Additionally, TestQuotaBasedScheduling is added to test
quota-based task scheduling. Note that quotas will apply to both generic and
targeted jobs.
A few bugs were uncovered during this process such as the faulty retry
logic that never really got tasks to restart. For more details, see the
changelist below:
Changelist:
1. Add ThreadCountBasedAssignmentCalculator, a wrapper around
ThreadCountBasedTaskAssigner
2. Make logic changes in JobRebalancer to enable the use of
ThreadCountBasedAssignmentCalculator
3. Fix the failing test by using a thread-safe map and rename
TestGenericTaskAssignmentCalculator to TestTaskAssignmentCalculator to better
reflect what its tests are doing
4. Add retry logic that was previously absent for INIT and DROPPED
tasks in JobRebalancer
5. Add TestQuotaBasedScheduling to test that jobs and tasks were being
assigned and scheduled per quota config set in ClusterConfig
6. Add more log messages to aid with task-scheduling debugging in
AssignableInstance
7. In AbstractTaskDispatcher, for tasks that are STOPPED, TIMED_OUT,
TASK_ERROR, the retry logic was newly implemented so that they get re-started
correctly
8. In AbstractTaskDispatcher, when enforcing overlapAssign for jobs
with isAllowOverlapAssignment(), a fix was implemented so that only jobs whose
state is IN_PROGRESS are considered
9. In AbstractTaskDispatcher, isWorkflowFinished() method was modified
so that non-active jobs will have their tasks' resource freed from
AssignableInstances to prevent resource leak
10. In markJobFailed() and markJobCompleted(), non-active jobs will have
their tasks' resource freed from AssignableInstances to prevent resource leak
11. Fix the logic so that quotas do not apply to targeted jobs
12. Fix TestTaskRebalancer (assumes Consistent Hashing, which is no
longer used)
13. Fix TestIndependentTaskRebalancer (assumes Consistent Hashing, no
longer used)
14. Assignment logic was improved so that incomplete tasks whose
assigned participants are no longer live will be re-assigned accordingly
15. Fix TestTaskRebalanceFailover (tasks on non-live instances will be
re-assigned promptly)
16. Fix TestRebalanceRunningTask (targeted jobs will get tasks
reassigned upon liveInstance and currentState change)
17. Fix a bug in FixedAssignmentCalculator and assignment logic for
targeted jobs such that a task index will no longer be assigned multiple times
18. Fix TestJobFailureTaskNotStarted (tasks were not being assigned at
all due to having reached maximum capacity for quota)
19. Add targetedTaskConfigMap field in JobConfig to cache TaskConfig
objects for targeted tasks to reduce object creation and GC overload
20. Fix JobConfig so that it doesn't write quotaType to ZooKeeper when
quotaType is null or not set
21. Fix deleteWorkflow() in TaskUtil so that the earliest delete failure
will render the entire method as failed (and return prematurely to prevent
breaking other ZNodes from incomplete deletion)
22. Fix TestDeleteWorkflow by adding another removeProperty() clause to
lower failure rate
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/narendly/helix 1324062
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/helix/pull/243.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #243
----
commit eb8e1d8d2560b28b6b1cb120c4a54e3f70356a3e
Author: Hunter Lee <narendly@...>
Date: 2018-07-13T21:45:41Z
[HELIX-730] Add ThreadCountBasedAssignmentCalculator and integrate with
Workflow/JobRebalancer and fix rebalancing logic
For quota-based scheduling of tasks, we have added the TaskAssigner
interface that takes into account AssignableInstances by way of
AssignableInstanceManager. In order to use this in the currently-existing
pipeline prior to Task Framework 2.0, GenericTaskAssignmentCalculator was
replaced with ThreadCountBasedAssignmentCalculator, which is a wrapper around
TaskAssigner. Necessary adjustments were made in Workflow/JobRebalancer for
this replacement. Also the rebalance logic in Workflow/JobRebalancer was
reviewed and fixed. Additionally, TestQuotaBasedScheduling is added to test
quota-based task scheduling. Note that quotas will apply to both generic and
targeted jobs.
A few bugs were uncovered during this process such as the faulty retry
logic that never really got tasks to restart. For more details, see the
changelist below:
Changelist:
1. Add ThreadCountBasedAssignmentCalculator, a wrapper around
ThreadCountBasedTaskAssigner
2. Make logic changes in JobRebalancer to enable the use of
ThreadCountBasedAssignmentCalculator
3. Fix the failing test by using a thread-safe map and rename
TestGenericTaskAssignmentCalculator to TestTaskAssignmentCalculator to better
reflect what its tests are doing
4. Add retry logic that was previously absent for INIT and DROPPED
tasks in JobRebalancer
5. Add TestQuotaBasedScheduling to test that jobs and tasks were being
assigned and scheduled per quota config set in ClusterConfig
6. Add more log messages to aid with task-scheduling debugging in
AssignableInstance
7. In AbstractTaskDispatcher, for tasks that are STOPPED, TIMED_OUT,
TASK_ERROR, the retry logic was newly implemented so that they get re-started
correctly
8. In AbstractTaskDispatcher, when enforcing overlapAssign for jobs
with isAllowOverlapAssignment(), a fix was implemented so that only jobs whose
state is IN_PROGRESS are considered
9. In AbstractTaskDispatcher, isWorkflowFinished() method was modified
so that non-active jobs will have their tasks' resource freed from
AssignableInstances to prevent resource leak
10. In markJobFailed() and markJobCompleted(), non-active jobs will have
their tasks' resource freed from AssignableInstances to prevent resource leak
11. Fix the logic so that quotas do not apply to targeted jobs
12. Fix TestTaskRebalancer (assumes Consistent Hashing, which is no
longer used)
13. Fix TestIndependentTaskRebalancer (assumes Consistent Hashing, no
longer used)
14. Assignment logic was improved so that incomplete tasks whose
assigned participants are no longer live will be re-assigned accordingly
15. Fix TestTaskRebalanceFailover (tasks on non-live instances will be
re-assigned promptly)
16. Fix TestRebalanceRunningTask (targeted jobs will get tasks
reassigned upon liveInstance and currentState change)
17. Fix a bug in FixedAssignmentCalculator and assignment logic for
targeted jobs such that a task index will no longer be assigned multiple times
18. Fix TestJobFailureTaskNotStarted (tasks were not being assigned at
all due to having reached maximum capacity for quota)
19. Add targetedTaskConfigMap field in JobConfig to cache TaskConfig
objects for targeted tasks to reduce object creation and GC overload
20. Fix JobConfig so that it doesn't write quotaType to ZooKeeper when
quotaType is null or not set
21. Fix deleteWorkflow() in TaskUtil so that the earliest delete failure
will render the entire method as failed (and return prematurely to prevent
breaking other ZNodes from incomplete deletion)
22. Fix TestDeleteWorkflow by adding another removeProperty() clause to
lower failure rate
----
---