GitHub user narendly opened a pull request:

    https://github.com/apache/helix/pull/243

    Add ThreadCountBasedAssignmentCalculator and integrate with 
Workflow/JobRebalancer and fix rebalancing logic

    …th Workflow/JobRebalancer and fix rebalancing logic
    
    For quota-based scheduling of tasks, we have added the TaskAssigner 
interface that takes into account AssignableInstances by way of 
AssignableInstanceManager. In order to use this in the currently-existing 
pipeline prior to Task Framework 2.0, GenericTaskAssignmentCalculator was 
replaced with ThreadCountBasedAssignmentCalculator, which is a wrapper around 
TaskAssigner. Necessary adjustments were made in Workflow/JobRebalancer for 
this replacement. Also the rebalance logic in Workflow/JobRebalancer was 
reviewed and fixed. Additionally, TestQuotaBasedScheduling is added to test 
quota-based task scheduling. Note that quotas will apply to both generic and 
targeted jobs.
    
    A few bugs were uncovered during this process such as the faulty retry 
logic that never really got tasks to restart. For more details, see the 
changelist below:
    
    Changelist:
        1. Add ThreadCountBasedAssignmentCalculator, a wrapper around 
ThreadCountBasedTaskAssigner
        2. Make logic changes in JobRebalancer to enable the use of 
ThreadCountBasedAssignmentCalculator
        3. Fix the failing test by using a thread-safe map and rename 
TestGenericTaskAssignmentCalculator to TestTaskAssignmentCalculator to better 
reflect what its tests are doing
        4. Add retry logic that was previously absent for INIT and DROPPED 
tasks in JobRebalancer
        5. Add TestQuotaBasedScheduling to test that jobs and tasks were being 
assigned and scheduled per quota config set in ClusterConfig
        6. Add more log messages to aid with task-scheduling debugging in 
AssignableInstance
        7. In AbstractTaskDispatcher, for tasks that are STOPPED, TIMED_OUT, 
TASK_ERROR, the retry logic was newly implemented so that they get re-started 
correctly
        8. In AbstractTaskDispatcher, when enforcing overlapAssign for jobs 
with isAllowOverlapAssignment(), a fix was implemented so that only jobs whose 
state is IN_PROGRESS are considered
        9. In AbstractTaskDispatcher, isWorkflowFinished() method was modified 
so that non-active jobs will have their tasks' resource freed from 
AssignableInstances to prevent resource leak
       10. In markJobFailed() and markJobCompleted(), non-active jobs will have 
their tasks' resource freed from AssignableInstances to prevent resource leak
       11. Fix the logic so that quotas do not apply to targeted jobs
       12. Fix TestTaskRebalancer (assumes Consistent Hashing, which is no 
longer used)
       13. Fix TestIndependentTaskRebalancer (assumes Consistent Hashing, no 
longer used)
       14. Assignment logic was improved so that incomplete tasks whose 
assigned participants are no longer live will be re-assigned accordingly
       15. Fix TestTaskRebalanceFailover (tasks on non-live instances will be 
re-assigned promptly)
       16. Fix TestRebalanceRunningTask (targeted jobs will get tasks 
reassigned upon liveInstance and currentState change)
       17. Fix a bug in FixedAssignmentCalculator and assignment logic for 
targeted jobs such that a task index will no longer be assigned multiple times
       18. Fix TestJobFailureTaskNotStarted (tasks were not being assigned at 
all due to having reached maximum capacity for quota)
       19. Add targetedTaskConfigMap field in JobConfig to cache TaskConfig 
objects for targeted tasks to reduce object creation and GC overload
       20. Fix JobConfig so that it doesn't write quotaType to ZooKeeper when 
quotaType is null or not set
       21. Fix deleteWorkflow() in TaskUtil so that the earliest delete failure 
will render the entire method as failed (and return prematurely to prevent 
breaking other ZNodes from incomplete deletion)
       22. Fix TestDeleteWorkflow by adding another removeProperty() clause to 
lower failure rate

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/narendly/helix 1324062

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/helix/pull/243.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #243
    
----
commit eb8e1d8d2560b28b6b1cb120c4a54e3f70356a3e
Author: Hunter Lee <narendly@...>
Date:   2018-07-13T21:45:41Z

    [HELIX-730] Add ThreadCountBasedAssignmentCalculator and integrate with 
Workflow/JobRebalancer and fix rebalancing logic
    
    For quota-based scheduling of tasks, we have added the TaskAssigner 
interface that takes into account AssignableInstances by way of 
AssignableInstanceManager. In order to use this in the currently-existing 
pipeline prior to Task Framework 2.0, GenericTaskAssignmentCalculator was 
replaced with ThreadCountBasedAssignmentCalculator, which is a wrapper around 
TaskAssigner. Necessary adjustments were made in Workflow/JobRebalancer for 
this replacement. Also the rebalance logic in Workflow/JobRebalancer was 
reviewed and fixed. Additionally, TestQuotaBasedScheduling is added to test 
quota-based task scheduling. Note that quotas will apply to both generic and 
targeted jobs.
    
    A few bugs were uncovered during this process such as the faulty retry 
logic that never really got tasks to restart. For more details, see the 
changelist below:
    
    Changelist:
        1. Add ThreadCountBasedAssignmentCalculator, a wrapper around 
ThreadCountBasedTaskAssigner
        2. Make logic changes in JobRebalancer to enable the use of 
ThreadCountBasedAssignmentCalculator
        3. Fix the failing test by using a thread-safe map and rename 
TestGenericTaskAssignmentCalculator to TestTaskAssignmentCalculator to better 
reflect what its tests are doing
        4. Add retry logic that was previously absent for INIT and DROPPED 
tasks in JobRebalancer
        5. Add TestQuotaBasedScheduling to test that jobs and tasks were being 
assigned and scheduled per quota config set in ClusterConfig
        6. Add more log messages to aid with task-scheduling debugging in 
AssignableInstance
        7. In AbstractTaskDispatcher, for tasks that are STOPPED, TIMED_OUT, 
TASK_ERROR, the retry logic was newly implemented so that they get re-started 
correctly
        8. In AbstractTaskDispatcher, when enforcing overlapAssign for jobs 
with isAllowOverlapAssignment(), a fix was implemented so that only jobs whose 
state is IN_PROGRESS are considered
        9. In AbstractTaskDispatcher, isWorkflowFinished() method was modified 
so that non-active jobs will have their tasks' resource freed from 
AssignableInstances to prevent resource leak
       10. In markJobFailed() and markJobCompleted(), non-active jobs will have 
their tasks' resource freed from AssignableInstances to prevent resource leak
       11. Fix the logic so that quotas do not apply to targeted jobs
       12. Fix TestTaskRebalancer (assumes Consistent Hashing, which is no 
longer used)
       13. Fix TestIndependentTaskRebalancer (assumes Consistent Hashing, no 
longer used)
       14. Assignment logic was improved so that incomplete tasks whose 
assigned participants are no longer live will be re-assigned accordingly
       15. Fix TestTaskRebalanceFailover (tasks on non-live instances will be 
re-assigned promptly)
       16. Fix TestRebalanceRunningTask (targeted jobs will get tasks 
reassigned upon liveInstance and currentState change)
       17. Fix a bug in FixedAssignmentCalculator and assignment logic for 
targeted jobs such that a task index will no longer be assigned multiple times
       18. Fix TestJobFailureTaskNotStarted (tasks were not being assigned at 
all due to having reached maximum capacity for quota)
       19. Add targetedTaskConfigMap field in JobConfig to cache TaskConfig 
objects for targeted tasks to reduce object creation and GC overload
       20. Fix JobConfig so that it doesn't write quotaType to ZooKeeper when 
quotaType is null or not set
       21. Fix deleteWorkflow() in TaskUtil so that the earliest delete failure 
will render the entire method as failed (and return prematurely to prevent 
breaking other ZNodes from incomplete deletion)
       22. Fix TestDeleteWorkflow by adding another removeProperty() clause to 
lower failure rate

----


---

Reply via email to