Hunter L created HELIX-730:
------------------------------

             Summary: [TASK] Add ThreadCountBasedAssignmentCalculator and 
integrate with Workflow/JobRebalancer and fix rebalancing logic
                 Key: HELIX-730
                 URL: https://issues.apache.org/jira/browse/HELIX-730
             Project: Apache Helix
          Issue Type: Improvement
            Reporter: Hunter L


For quota-based scheduling of tasks, we have added the TaskAssigner interface 
that takes into account AssignableInstances by way of 
AssignableInstanceManager. In order to use this in the currently-existing 
pipeline prior to Task Framework 2.0, GenericTaskAssignmentCalculator was 
replaced with ThreadCountBasedAssignmentCalculator, which is a wrapper around 
TaskAssigner. Necessary adjustments were made in Workflow/JobRebalancer for 
this replacement. Also the rebalance logic in Workflow/JobRebalancer was 
reviewed and fixed. Additionally, TestQuotaBasedScheduling is added to test 
quota-based task scheduling. Note that quotas will apply to both generic and 
targeted jobs.

A few bugs were uncovered during this process such as the faulty retry logic 
that never really got tasks to restart. For more details, see the changelist 
below:

Changelist:
    1. Add ThreadCountBasedAssignmentCalculator, a wrapper around 
ThreadCountBasedTaskAssigner
    2. Make logic changes in JobRebalancer to enable the use of 
ThreadCountBasedAssignmentCalculator
    3. Fix the failing test by using a thread-safe map and rename 
TestGenericTaskAssignmentCalculator to TestTaskAssignmentCalculator to better 
reflect what its tests are doing
    4. Add retry logic that was previously absent for INIT and DROPPED tasks in 
JobRebalancer
    5. Add TestQuotaBasedScheduling to test that jobs and tasks were being 
assigned and scheduled per quota config set in ClusterConfig
    6. Add more log messages to aid with task-scheduling debugging in 
AssignableInstance
    7. In AbstractTaskDispatcher, for tasks that are STOPPED, TIMED_OUT, 
TASK_ERROR, the retry logic was newly implemented so that they get re-started 
correctly
    8. In AbstractTaskDispatcher, when enforcing overlapAssign for jobs with 
isAllowOverlapAssignment(), a fix was implemented so that only jobs whose state 
is IN_PROGRESS are considered
    9. In AbstractTaskDispatcher, isWorkflowFinished() method was modified so 
that non-active jobs will have their tasks' resource freed from 
AssignableInstances to prevent resource leak
   10. In markJobFailed() and markJobCompleted(), non-active jobs will have 
their tasks' resource freed from AssignableInstances to prevent resource leak
   11. Fix the logic so that quotas do not apply to targeted jobs
   12. Fix TestTaskRebalancer (assumes Consistent Hashing, which is no longer 
used)
   13. Fix TestIndependentTaskRebalancer (assumes Consistent Hashing, no longer 
used)
   14. Assignment logic was improved so that incomplete tasks whose assigned 
participants are no longer live will be re-assigned accordingly
   15. Fix TestTaskRebalanceFailover (tasks on non-live instances will be 
re-assigned promptly)
   16. Fix TestRebalanceRunningTask (targeted jobs will get tasks reassigned 
upon liveInstance and currentState change)
   17. Fix a bug in FixedAssignmentCalculator and assignment logic for targeted 
jobs such that a task index will no longer be assigned multiple times
   18. Fix TestJobFailureTaskNotStarted (tasks were not being assigned at all 
due to having reached maximum capacity for quota)
   19. Add targetedTaskConfigMap field in JobConfig to cache TaskConfig objects 
for targeted tasks to reduce object creation and GC overload
   20. Fix JobConfig so that it doesn't write quotaType to ZooKeeper when 
quotaType is null or not set
   21. Fix deleteWorkflow() in TaskUtil so that the earliest delete failure 
will render the entire method as failed (and return prematurely to prevent 
breaking other ZNodes from incomplete deletion)
   22. Fix TestDeleteWorkflow by adding another removeProperty() clause to 
lower failure rate



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to