GitHub user narendly opened a pull request:
https://github.com/apache/helix/pull/275
PR
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/narendly/helix master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/helix/pull/275.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #275
----
commit e7b960c22896c08337292d20f674f20a7f1391d0
Author: Hunter Lee <hulee@...>
Date: 2018-10-27T01:32:16Z
[HELIX-762] TASK: Change LOG mode from info to debug
In production, it was observed that some users were running thousands of
tasks, and since AssignableInstance leaves a line of log for each task assigned
or released, the amount of log that was being generated was too much, and it
was too verbose.
Changelist:
1. Change the logging mode from info to debug in AssignableInstance and
AssignableInstanceManager
commit e492d9f663d8edad0f344208cc8affc6828708a3
Author: Hunter Lee <hulee@...>
Date: 2018-10-27T01:49:52Z
[HELIX-763] Task:Ignore tasks whose workflow and job are inactive
It was discovered that by manual testing, there were task states in INIT
and RUNNING, and they were occupying a thread count even though their parent
job or workflow was in an inactive state (terminal or stopped). This was
happening when the capacities were being rebuilt from scratch, which could have
caused a thread leak.
Changelist:
1. Add a check in buildAssignableInstances() so that it ignores workflows
and jobs whose states are inactive states (that is, their tasks cannot be
occupying a thread on Participants)
commit d33d9efea25fe9d2bbbb9e84a4ce7614b544ef2d
Author: Hunter Lee <hulee@...>
Date: 2018-10-27T02:03:47Z
[HELIX-764] TASK: Fix LiveInstanceCurrentState change flag
Previously, existsLiveInstanceOrCurrentStateChange was getting reset in
ClusterDataCache when its getter was called. This was problematic because if
there were multiple jobs or multiple workflows, whoever calls this getter would
get the correct flag value, and the ensuing callers would get a false because
the flag would have been reset. This RB fixes that bug by reseting the flat
right in the beginning of refresh() call in ClusterDataCache, which allows all
callers during that pipeline would get the same, correct value.
Changelist:
1. Change the getter so that it does not reset the flag; instead, reset the
flag in the beginning of refresh()
commit 930a4b7ae7eb63be0a751a593ba630ae55fb2cfb
Author: Hunter Lee <hulee@...>
Date: 2018-10-27T02:06:42Z
[HELIX-765] TASK: Build quota profile from scratch every rebalance
It has been reported that instances have a full quota despite no tasks
existing in their CURRENTSTATES. The cause of this is not clear, so making
ClusterDataCache trigger a refresh of all AssignableInstances will ensure that
there aren't situations where it looks like there has been a thread leak.
Optimizations will be implemented if necessary.
Changelist:
1. Make AssignableInstanceManager build all AssignableInstances from
scratch every rebalance
commit 5033785c231af363953367f65f77513911b753f5
Author: Hunter Lee <hulee@...>
Date: 2018-10-27T02:08:02Z
[HELIX-766] TASK: Add logging functionality in AssignableInstanceManager
In order to debug task-related inquiries and issues, we realized that it
would be very helpful if we logged there was a log recording the current quota
capacity of all AssignableInstances. This is for cases where we see jobs whose
tasks are not getting assigned so that we could quickly rule out the
possibility of bugs in quota-based scheduling.
Changelist:
1. Add a method that logs current quota profile in a JSON format with
an option flag of only displaying when there are quota types whose capacities
are full
2. Add info logs in AssignableInstanceManager
----
---