[ 
https://issues.apache.org/jira/browse/STORM-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15406303#comment-15406303
 ] 

Robert Joseph Evans commented on STORM-2018:
--------------------------------------------

I would propose that we move to a model where we have 3 main types of threads, 
for the supervisor itself. Threads in the localizer are different.

1) A Single HB thread (very much like it is now)
2) A Single Scheduling Sync Thread that would
{code}
while (!done) {
  read scheduling from ZK && sanity check with retry like today;
  for (int port: set.union(scheduling.ports, slots.keys)) {
    Slot s = slots.get(port);
    if (s == null) {
        s = new Slot();
        slots.put(port, s);
    }
    s.setNewAssignment(scheduling.get(port));
  }
  sleep(...);
}
{code}
3) A Slot thread per slot.  This thread would more or less do the following
{code}
while(!done) {
  Assignment newAssignment = this.newAssignment;
  StateMachine.transitionIfNeeded(newAssignment,...);
}
{code}
The state machine itself is described in 
[Slot.dot|https://issues.apache.org/jira/secure/attachment/12821873/Slot.dot] 
and you can see a visualization in Slot.svg
!Slot.svg!

Slot would have just a few methods to set things asynchronously
{code}
public void setNewAssignment(Assignment...);
public void informWorkerDied(String workerId...);
{code}

Every time that current assignment is written to it would also be written out 
to disk so if we crash we can recover.

> Simplify Threading Model of the Supervisor
> ------------------------------------------
>
>                 Key: STORM-2018
>                 URL: https://issues.apache.org/jira/browse/STORM-2018
>             Project: Apache Storm
>          Issue Type: New Feature
>          Components: storm-core
>    Affects Versions: 1.0.0, 2.0.0
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>         Attachments: Slot.dot, Slot.svg
>
>
> We have been trying to roll out CGROUP enforcement and right now are running 
> into a number of race conditions in the supervisor.  When using CGROUPS the 
> timing of some operations are different and are exposing issues that we would 
> not see without this.
> In order to make progress with testing/deploying CGROUP and RAS we are going 
> to try and refactor the supervisor to have a simpler threading model, but 
> likely with more threads.  We will base the code off of the java code 
> currently in master, and may replace that in the 2.0 release, but plan on 
> having it be a part of 1.x too, if it truly is more stable.
> I will try to keep this JIRA up to date with what we are doing and the 
> architecture to keep the community informed.  We need to move quickly to meet 
> some of our company goals but will not just shove this in.  We welcome any 
> feedback on the design and code before it goes into the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to