[
https://issues.apache.org/jira/browse/STORM-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15406303#comment-15406303
]
Robert Joseph Evans commented on STORM-2018:
--------------------------------------------
I would propose that we move to a model where we have 3 main types of threads,
for the supervisor itself. Threads in the localizer are different.
1) A Single HB thread (very much like it is now)
2) A Single Scheduling Sync Thread that would
{code}
while (!done) {
read scheduling from ZK && sanity check with retry like today;
for (int port: set.union(scheduling.ports, slots.keys)) {
Slot s = slots.get(port);
if (s == null) {
s = new Slot();
slots.put(port, s);
}
s.setNewAssignment(scheduling.get(port));
}
sleep(...);
}
{code}
3) A Slot thread per slot. This thread would more or less do the following
{code}
while(!done) {
Assignment newAssignment = this.newAssignment;
StateMachine.transitionIfNeeded(newAssignment,...);
}
{code}
The state machine itself is described in
[Slot.dot|https://issues.apache.org/jira/secure/attachment/12821873/Slot.dot]
and you can see a visualization in Slot.svg
!Slot.svg!
Slot would have just a few methods to set things asynchronously
{code}
public void setNewAssignment(Assignment...);
public void informWorkerDied(String workerId...);
{code}
Every time that current assignment is written to it would also be written out
to disk so if we crash we can recover.
> Simplify Threading Model of the Supervisor
> ------------------------------------------
>
> Key: STORM-2018
> URL: https://issues.apache.org/jira/browse/STORM-2018
> Project: Apache Storm
> Issue Type: New Feature
> Components: storm-core
> Affects Versions: 1.0.0, 2.0.0
> Reporter: Robert Joseph Evans
> Assignee: Robert Joseph Evans
> Attachments: Slot.dot, Slot.svg
>
>
> We have been trying to roll out CGROUP enforcement and right now are running
> into a number of race conditions in the supervisor. When using CGROUPS the
> timing of some operations are different and are exposing issues that we would
> not see without this.
> In order to make progress with testing/deploying CGROUP and RAS we are going
> to try and refactor the supervisor to have a simpler threading model, but
> likely with more threads. We will base the code off of the java code
> currently in master, and may replace that in the 2.0 release, but plan on
> having it be a part of 1.x too, if it truly is more stable.
> I will try to keep this JIRA up to date with what we are doing and the
> architecture to keep the community informed. We need to move quickly to meet
> some of our company goals but will not just shove this in. We welcome any
> feedback on the design and code before it goes into the community.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)