Added: uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part5/ducc-pops-component-rm.tex URL: http://svn.apache.org/viewvc/uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part5/ducc-pops-component-rm.tex?rev=1727979&view=auto ============================================================================== --- uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part5/ducc-pops-component-rm.tex (added) +++ uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part5/ducc-pops-component-rm.tex Mon Feb 1 17:36:08 2016 @@ -0,0 +1,1109 @@ +% +% Licensed to the Apache Software Foundation (ASF) under one +% or more contributor license agreements. See the NOTICE file +% distributed with this work for additional information +% regarding copyright ownership. The ASF licenses this file +% to you under the Apache License, Version 2.0 (the +% "License"); you may not use this file except in compliance +% with the License. You may obtain a copy of the License at +% +% http://www.apache.org/licenses/LICENSE-2.0 +% +% Unless required by applicable law or agreed to in writing, +% software distributed under the License is distributed on an +% "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +% KIND, either express or implied. See the License for the +% specific language governing permissions and limitations +% under the License. +% + + +This chapter provides architectural and implementation details for the DUCC +Resource Manager, referred to as the ``RM''. +\section{Introduction} + +The DUCC Resource Manager is responsible for apportioning cluster resources to the +collection of ``work'' to be run by users. Work is classified into several categories. As +exposed in the public interface, these categories are: + +\begin{description} + \item[Fair-Share Job] This is a UIMA/AS job, consisting of a minimum of two processes and + a potential maximum of as many processes as physically fit on a cluster. The work + executed by the processes is parallel, enabling the RM to expand or contract + the job by allocating or deallocating processes as needed to balance the load. + + Load is balanced using a weighted fair-share policy in which all users are apportioned an + equitable amount of the cluster resources, where ``cluster resources'' refers only to real + memory on the cluster nodes. + + \item[Service Instance] This is any arbitrary process that DUCC manages as a ``service''. + Services are registered with the Service Manager and may be comprised of multiple physical + processes. (See the DuccBook for details of DUCC Service Management.) The RM schedules these + processes as singletons, using a non-preemptive policy (FIXED\_SHARE or RESERVE). + + \item[Arbitrary Process or ``Managed Reservation''] These are singleton processes of any type, scheduled + using a FIXED\_SHARE policy. + + \item[Fixed-Share Job] This is a UIMA-AS job scheduled with a non-preemptable, i.e. FIXED\_SHARE + policy. + + \item[Reservation] This is a request for a dedicated full machine. +\end{description} + +The RM is a memory scheduler only. The use case which justifies DUCC is UIMA-AS jobs, each of +which consists a variable number of parallel processes, each of which requires large amounts of memory, usually +on the order of 16GB or more. Memory requirements completely overwhelm other resource +requirements, so that jobs scheduled by their declared memory sizes usually get sufficient +other resource such as CPU. + +\section{Vocabulary} + In order to understand RM it is necessary to understand some of the language used in RM. + + \begin{description} + \item[quantum] This is the smallest memory size of an allocation, defined in multiples of GB. It + is defined globally in {\em ducc.properties} and may be overridden in {\em ducc.classes} for + top-level nodepools. See the DuccBook for more details. + + Note that although DUCC defines a quantum, most of the RM does not + use quantum at all; instead it generalizes quantum into {\em qshare}, {\em nshare}, + and {\em order} as defined below. When a schedule is returned to the Orchestrator, the + allocations, in terms of quanta, are translated back to memory allocations using this + configured {\em quantum}. + + \item[qshare] This is an abstract memory allocation representing exactly one {\em quantum}. Memory + allocations are made in terms of some multiple of {\em qshares}. + + \item[nshare] This is an allocation which consists of one or more co-located {\em qshares}. When + exposed outside of RM this is usually thought of as a ``process''. It means, literally, + ``n qshares''. + + Be careful, an {\em nshare} is NOT a process, it is an allocation that can be put to + any use or to no use if desired. The RM does not care what an {\em nshare} is used for. + + \item[order] This is a number which refers to the number of {\em qshares} associated with some + entity such as a machine, a job, a process, an {\em nshare}. An ``order 3'' machine is a + machine whose memory can be counted as ``three qshares''. An ``order 3'' job is a job whose + processes must be allocated as ``three qshares'' each, or one ``order three'' nshare. + + All {\em qshares} are of order 1, but we don't know how much that is without knowing + the {\em quantum}. + + {\em Order} is an abstraction of the {\em quantum}. Knowing the {\em order} + of any entity does not tell one how large that entity is. It does tell one + how big that entity is in relation to other entities. + + Note that {\em order} is NOT an allocation; it is not a {\em qshare} or an {\em nshare}. + It is a number which describes the size of things, without saying how big that size + is or what that thing is. + + Order is used throughout the RM to index arrays and is one of the most fundamental + concepts in the RM architecture. + \end{description} + +\section{Architectural Overview} + Throughout this section, please refer to the diagram in \hyperref[fig:rm-structure]{Figure ~\ref{fig:rm-structure}}. + The diagram shows the normal flow through the scheduler, from the time and Orchestrator + publication arrives to the time the RM publishes its schedule. + +\subsection{The Onion} + At the highest level of abstraction, the RM consists of three layers as seen in the + \hyperref[fig:rm-structure]{Figure ~\ref{fig:rm-structure}} below. It can be thought of as a three-layer onion: + \begin{description} + \item[JobManagerConverter] This is the outer layer. This layer communicates with the + ``outside'' world: the DUCC Orchestrator and DUCC agents. It is conceived of as in + ``impedance matcher'' that converts incoming messages to the RM's internal structures, and + converts the RM's internal structures into structures recognized by the outside world (the + DUCC Orchestrator). It is possible to replace this layer without affecting the RM proper; for + example, to create an isolated simulator for RM development. + \item[Scheduler] This is the middle layer. It communicates on one hand with the {\em JobManagerConverter} + to receive and send data outside of RM and on the other, with the inner layer, the {\em NodepoolScheduler}. + \item[NodepoolSchedler] This is the inner layer and is the ``scheduler proper''. Its input is + a set of allocation requests and its output is a set of node assignments for each request. These are + passed up through the {\em Scheduler} and again up to the {\em JobManagerConverter} for publication. + \end{description} + +\subsection{Nodes, Machines, and Node Management} + The resources scheduled against are nodes. On each physical node is a DUCC Agent which + publishes a regular ``heartbeat''. The ``heartbeat'' contains data describing the + characteristics of the node used for scheduling: the memory size, the node name, the + node IP. + + If a node fails, its ``heartbeat'' stops arriving. After some number of missed + heartbeats, the RM considers the node to be unusable; it will stop scheduling work + to that node and attempt to evict existing work from the node. Work that can + be rescheduled (fair-share jobs and services) get rescheduled on working nodes. + + It is possible to remove a node from scheduling without the node failing using the + {\em vary\_off} utility. This causes the RM to stop scheduling work to the + specified node, and it causes fair-share work to be evicted and rescheduled elsewhere. + + The RM component responsible for managing nodes is {\em NodePool.java}. As each + node heartbeat arrives, the {\em NodePool} is notified and, if it is the first + such notification, creates an object RM calls the {\em Machine} to represent the + remote resource. + + The RM implements multiple nodepools in a nested or tree structure. There is one NodePool + object for each configured nodepool. The NodePools are structured in a self-organizing + tree (that is, none of the Java collection classes are used to organize multiple nodepools). + Most of the methods inside the NodePool module are recursive, aware of their parent + NodePools and child NodePools. + +\subsection{Jobs} + The RM is mostly unaware of the purpose for which allocations are made. It uses a + single structure, the {\em RmJob} to represent all work. There are a small number + of minor switches in the code in deference to specific types of work (UIMA-AS jobs + vs reservations, for example) but these are not sufficient to justify a more elaborate + object structure. + +\subsection{Shares and share-order} + All RM allocations are in terms of {\em shares}. A share represents some portion + of the real memory on a node. The smallest {\em share} than can be allocated is + called a {\em quantum share}. The size of the {\em share quantum} is declared + in {\em ducc.properties} and can be overridden for {\em top-level nodepools} in + {\em ducc.classes} (see the DuccBook for the details of quantum configuration). + + In the RM, a single quantum share is called a {\em Qshare}. Multiple co-located {\em qshares} + may be allocated for any given request. A multi-qshare allocation is called an {\em nshare}. A + {\em nshare} always corresponds to + \begin{itemize} + \item A single process of a FAIR\_SHARE job, + \item A single process of a FIXED\_SHARE job, + \item A single service instance process, + \item A single AP (or ``managed reservation'', + \item A single unchanged reservation. + \end{itemize} + + Thus a job may be allocated multiple {\em nshares}. {\em Nshares} are usually exposed outside + of RM under the term {\em process}. (There is a difference: an {\em nshare} refers to an + abstract allocation; a {\em process} refers to some physical process running on a node. RM only + needs to deal in abstractions.) + + The term {\em share-order} is used to refer to the number of {\em qshares} associated with + an entity. Entities with associated {\em share-order} include + \begin{itemize} + \item Jobs. The {\em share-order} of a job is the number of {\em qshares} required + to make a single {\em nshare}. It is a translation of the job's declared memory + requirements into the number of share quanta required per process. + \item Nodes (Machines). The {\em share-order} of a machine is the number of + {\em qshares} that can be allocated on that machine. It is a translation of the + machine's real memory into the number of share quanta that can be allocated to the machine. + \item Shares. The order of a {\em share} is the number of {\em qshares} represented + by the {\em Share} object. Note this implies that a {\em Share} object always + represents a single {\em Nshare}. + \end{itemize} + + All of the important algorithms in the RM involve managing incoming work and + resources by their ``order''. + + A Job's memory specification are converted to {\em share-order} thus: +\begin{verbatim} + share_order = job_declared_memory / share_quantum + if ( job_declared_memory % share_quantum ) > 0 + share_order + share_order + 1 +\end{verbatim} + Note that a job's share order is always rounded UP. + + A Machine's announced memory is converted to {\em share-order} thus: +\begin{verbatim} + share_order = (int) (allocatable_mem / share_quantum; +\end{verbatim} + Note that a machine's share order is always rounded DOWN. + +\subsection{Users} + Every job that arrives has an associated user. The RM maintains a map of all users and + maintains a two-way map of jobs-to-users. Note that every job has exactly one {\em user} + but that every user may have an arbitrary number of {\em jobs}. Thus, a user may + be associated with work of different {\em order}s and running in different {\em classes} + under differing policies. The structure defined in {\em User.java} maintains all + necessary records as used by the scheduler. + +\subsection{RM Structure Schematic} + \begin{figure}[H] + \centering + \includegraphics[width=5.5in]{images/ducc-internals/rm-structure.png} + \caption{RM Structure} + \label{fig:rm-structure} + \end{figure} + +\section{Outer Layer: JobManagerConverter} + + The {\em JobManagerConverter} is the outermost layer of the RM. It is intended as a + buffer or ``impedance matcher'' to protect the RM from the ``outside world''. It is + also intended to be replaceable as needed. It communicates with the middle layer through + an interface. Any entity that (correctly) uses this interface may act as the outer layer. + + This section describes the most important + functions of the {\em JobManagerConverter} in detail. We refer to this as the + JMC for brevity. + + The primary function of the JMC is to receive incoming work in the form of Orchestrator + publications and convert them into a set of discrete scheduling events to be passed to the inner + layers of the RM. + +\subsection{Incoming Work} + Key methods involved in receiving work and passing it to the next layer are described here. + + \paragraph{eventArrives()} receives the {\em DuccWorkMap} from the Orchestrator. + + If the RM is not yet initialized the map is ignored. + + If the RM has been recently reconfigured, all structures in JMC are cleared and + state set as if this is the first publication. + + If this is the first publication, we pass the map to the method {\em recoverFromOrchestrator} + to initialize essential structures for work that has {\em ALREADY} been scheduled and is + running in other parts of the system. This step is needed for these cases: + \begin{itemize} + \item DUCC is being started ``warm''. In this case the Orchestrator map may include + Reservations, which are permanently scheduled, and must be recovered. + \item The RM may have been stopped (or crashed) and is being restarted. In this case + work of all sorts that was already scheduling must be recovered. + \item The RM may have been dynamically reconfigured. Dynamic reconfiguration requires + that all internal structures be reset. This is the equivalent to stopping and then + restarting the RM. Work must be recovered. + \end{itemize} + + The incoming map is now saved for the map-difference code. If this is the first publication, + RM simply returns. + + All subsequent Orchestrator publications are compared with the previous map and + all differences are converted to scheduling events. + + There are three types of events: + \begin{description} + \item[New work] If the work has never been seen before, it is passed to the method + {\em jobArrives} for conversion into the RM internal structure {\em RmJob}. The new + work is passed to the middle layer {\em Scheduler} via {\em signalNewWork()}. + \item[Completed Work] If the work is marked {\em completed} by the Orchestrator it is + removed from the local map and the {\em Scheduler} is signalled via {\em signalCompletion()}. + \item[Existing Work] The associated {\em process map} for each DuccWork object is differenced against the + previous map to identify processes which may have + completed or otherwise changed state. The {\em RmJob} is fetched from {\em Scheduler} and + the state of its shares or the job itself is updated. If a process is completed, + the {\em Scheduler} is signalled via {\em signalCompletion()}, overloaded on share instead of job. If + at least one process has reached the {\em Running} state the {\em RmJob} is notified so + the {\em expand-by-doubling} policy can be enacted. + \end{description} + + Once the incoming events are processed the middle layer is signaled by invoking the method + {\em schedule()}. + +\subsection{Outgoing Events} + A schedule is returned from {\em Scheduler} in the form of a {\em JobManagerUpdate} object. This + object must be translated into an outgoing publication of the form expected by the Orchestrator. The + {\em JobManagerUpdate} is passed to the {\em JobManagerConvert.createState()} method for conversion. + + The final publication is returned in the form of a {\em RmStateDuccEvent} which is then passed to the + Camel framework for publication. + +\section{Middle Layer: Scheduler} + + The ``middle layer'' is implemented in {\em Scheduler.java}. This entity must conform to the + interface {\em ISchedulerMain} to maintain the layered ``onion'' architecture. The ``outer layer'' + does all its interactions with the scheduler proper through this interface. + + The middle layer is relatively straightforward. It is the middle-man between the {\em JobManagerConverter} + and scheduler proper, responsible for initialization, global bookkeeping, and dispatching of events + to the correct objects. We'll simply list the important functions + and how to find them: + + \begin{description} + \item[Initialization] {\em Scheduler.init()} is called from the DUCC infrastructure for RM, + {\em ResourceManagerComponent}. RM configuration from {\em ducc.properties} is loaded, the + class configuration from {\em ducc.classes} is loaded, the RM configuration is announced to the + log, the database of dynamic RM data is cleared, and the ``initialized'' flag is set to ``true''. + \item[Class configuration] This ({\em initClasses()}) is invoked out of {\em init()}. The class configuration is loaded + into the common {\em NodeConfiguration} object and a set of {\em ResourceClass} objects is + created. The {\em NodePool} objects are instantiated. + \item[Re-configuration] This is implemented in the {\em reconfigure()} method. Most internal structures + are cleared and released and {\em init()} invoked as described above. + \item[Node Publications] All node publications are passed to the method {\em nodeArrives()}. This + method does a bit of bookkeeping, works out the {\em order} of the node, records the {\em node heartbeat}, + and passes the node to its NodePool for future scheduling. + \item[Run Scheduler] The {\em schedule()} method is invoked from the outer layer as described + in the previous section. This method drains incoming events, ``recovers'' any + previously-scheduled work, updates state to reflect processes which have exited, and enters + new jobs and users into the system. It then invokes the ``inner layer'', the {\em + NodePoolScheduler} on each top-level nodepool. This results in creation of the new + schedule which is passed back to the outer-layer for publication by means of the {\em + dispatch()} method. + \item[Dispatch] This method ({\em dispatch()}) records the current schedule in the log + and converts the schedule into a form usable by the {\em JobManagerConverter} for publication. The + object created here, {\em JobManagerUpdate} is passed up and published. + \item[CLI] All CLI methods are handled here, passed in from the outer layer from {\em ResourceManagerComponent}. + + \end{description} + +\section{Inner Layer: NodepoolScheduler and NodePool} + The {\em NodePoolScheduler} and it's helper {\em NodePool} comprise the ``scheduler proper''. They are + both relatively complex. This section discusses their architecture and the general flow of data + through them. Readers would be advised to have code listings handy if the goal is to fully understand + the DUCC Scheduler. + + The {\em NodepoolScheduler} is the ``main'' scheduler. An analogy would be that it is the ``frontal cortex'' + of the brain, doing most of the abstract reasoning required to form a schedule. + + The {\em NodePool} is a helper class, responsible for managing physical layout of processes ({\em ``nshares''}) + over physical nodes ({\em Machines}). It can be thought of as the ``cerebellum'', controlling the ``arms and legs'' + of the schedule. + + The scheduling rule {\em ``priority''} is implemented by executing the {\em How Much} and {\em + What Of} phases once for each priority, starting with the ``best'' priority, down to the + ``worst'' priority. At each stage the scheduler attempts to give away all of its + resources. Each subsequent cycle through this loop will generally have fewer resources to + allocate until either all work is scheduled, or all resources are exhausted, whichever comes + first. + + After the first two phases are complete in all passes, all fair-share jobs are iterated and any job whose + physical allocation exceeds the number of resources counted in the ``How Much'' phase has its surplus + processes preempted. (These preempted resources are NOT added back to the resource pools until the Orchestrator + confirms they have exited; hence they aren't accounted for in the ``what of'' phase AT THIS POINT. They + will be used once they are known to be free.) + + Finally, the {\em defragmentation} phase is executed. + +\subsection{NodepoolScheduler} + + We will use a ``divide and conquer'' approach to describe the {\em NodpoolScheduler}. This component consist of + three primary parts: + \begin{description} + \item[How Much.] This phase performs the FAIR\_SHARE calculations as well as works out the + allotments for FIXED\_SHARE and RESERVE requests. It assumes an ``ideal'' configuration of + nodes with no conflicts and no fragmentation. There is one {\em How Much} method for + each of the three scheduling policies ({\em howMuchFairShare(), howMuchFixed(),} and + {\em howMuchReserve()}. + + + \item[What Of.] This phase works closely with the {\em NodePool} to try to find available + space for the abstract allocations produced by the ``How Much'' phase. It is responsible for initiating + preemptions but it never preempts a job below the counts from the ``How Much'' phase. It preserves + a stable layout by never preempting work that is already allocated unless that work is exceeds + some user's fair share as determined by the ``counts'' from ``How Much''. + + Note that because it is constrained by the existing layout it may not always succeed + laying out all work. If this occurs we must perform ``Defragmentation''. + + The three relevant methods are {\em whatOfFairShare(), whatOfFixed(),} and {\em whatOfReserve()}. + + \item[Defragmentation] After ``What Of'', a pass is made to insure that every job is allocated + its fair share. If not, defragmentation is performed to insure that at least ``some minimum'' + number of processes is allocated for every job. This may involve preemptions + of job processes even for user whose allocations are at or below their fair share. + \end{description} + + + We now describe these three actions in detail. + +\subsubsection{How Much} + + For non-preemptive work this is straightforward: the work is assigned whatever is asked for UP TO + the configured user allotment (see the DuccBook for details of allotment). Non-preemptive work + belonging to users who are at or over their allotment is deferred and not passed to further scheduling stages. + + The FAIR\_SHARE algorithm is performed on each of three {\em entities}: The ResourceClass, the User, and + the RmJob. Throughout the rest of the discussion the term {\em entity} is used to refer to any + of these when the specific type of object is not relevant. (Each of these entities implement the + generalized {\em IEntity} interface.). + + \begin{description} + \item[ResourceClass] Every {\em ResourceClass} is asked to provide a summary of how many {\em nshares} + of each size it could use, assuming unbounded resources, (but constrained by RM rules such + as initialization cap and expand-by-doubling). They produce an + array, indexed by {\em share order} of the number of processes of each order they want allocated. + + To produce this array, the {\em ResourceClass} iterates all jobs ({\em RmJob} structures) assigned to the class and ask + the same question of the RmJobs:, ``in an unbounded world what is the maximum number of processes you require''. The + method responding, {\em RmJob.getJobCaps()} examines the number of work items not-completed, the number + of threads per process, and their {\em process\_deployments\_max} to produce an initial guess. It then + takes into account ``doubling'' to revise the estimate down. It then uses the process initialization + time and average execution time per work-item to again revise the estimate down if it appears + new allocations would not be used by the time they were made available. (This process is described in greater + detail below.) + + The short description of what {\em getJobCaps()} does is this: start with the largest reasonable request + and whittle it down using the constraints of the RM rules to the smallest number of processes that + is guaranteed to be used, RIGHT NOW. + + The sum of all job caps by ResourceClass, indexed by {\em share\_order} is used to create + the scheduling {\em demand.} + + The NodePools are then interrogated to produce a similar array, indexed by {\em share + order}, of the number of processes they can provide, accounting only for existing + committed allocations This produced an idealized view of the {\em resources}. + + The algorithm implemented in {\em apportionQShares} then performs a FAIR\_SHARE allocation of + {\em nshares} to every job by matching {\em demand} with {\em resources}. We'll describe this allocation in greater detail below. + + At the end of this phase, every {\em ResourceClass} contains a table called {\em given\_by\_order} which is + the number of {\em nshares} indexed by {\em share order} to be assigned the jobs in the ResourceClass, + according to weighted fair-share. At this stage + it is not known if it is physically possible to actually fulfill these allocations. + + \item[User] Next, for each resource class, all the users owning jobs in that class are + gathered. The same weighted FAIR\_SHARE code is executed against users, counting only jobs + in the current class, but using the hard-coded weight of ``1'' (one). This results in an + equitable distribution of the weighted FAIR\_SHARE allocations from the current ResourceClass among + the users of that class. + + At the end of this phase, every {\em User} contains a table {\em given\_by\_order} which is the total + shares allocated to this user, for all jobs in this class. + + \item[Job] After allocating jobs among users for each resource class, each {\em User} with + jobs in the class has the shares apportioned by the previous steps divided equally among all + their jobs in that class, again using the same weighted FAIR\_SHARE routine with hard-coded + weight of ``1'' (one). + + At the end of this phase as before, all affected {\em RmJob}s have a table {\em + given\_by\_order} which contains the number of {\em nshares} assigned to that + job. + \end{description} + +\subsubsection{apportionQSares()} + The method {\em apportionQShares()} is the method that performs the FAIR\_SHARE allocation for the + ``How Much'' phase. + + The {\em apportionQshares()} method is much more involved than simply performing a weighted apportionment + of some abstract number of qshares among the various entities (ResourceClass, User, RmJob). Because + every resource may be of different {\em share order}, and the set of jobs being scheduled to a ResourceClass + are generally of different {\em share order}, this method must perform an equitable distribution of {\em qshares} but it + must assign them as {\em nshares} that can be also physically allocated. We must perform weighted fair-share + against the ``demand'' in terms of basic scheduling unit {\em qshares}, but we must produce a tentative schedule in terms of + {\em nshares} which can be mapped to real, known, physical machines. + + State simply, it is useless to allocate shares on a machine of order $n$ to a job of order $>n$: the + job won't ``fit''. + + In {\em apportionQShares()} we perform a series of iterations by decreasing {\em share\_order}, + each iteration performing ``fair share'' allocation of resources among resources of that order, + but using the TOTAL demand in {\em qshares} of the entity, ignoring for the moment whether it + will ``fit''. + + At the end of each iteration, entities which have their ``demand'' satisfied at the current + order are removed, and the iteration is repeated with the next smaller order, until either all + ``demand'' is satisfied or all resources are exhausted. + + This produces an imperfect schedule that is ``pretty close'' and is computationally simple to + produce. The defragmentation step at the end of scheduling provides additional correction. + + The general mechanism is as follows: + \begin{itemize} + \item Initialize the {\em wanted\_by\_order} structure (the number of {\em nshares} of each {\em share order} + wanted by all entities in the current pass. + \item Starting at the largest share order, called ``current order'', + \begin{itemize} + \item Calculate weighted FAIR\_SHARE for only entities of ``current order'' against all resources of + current order or greater, using total unfulfilled {\em demand} for the entity. + \item Assign new {\em nshares} to the entities, incrementing their {\em given\_by\_order} for the current order. + \item Decrement entities' {\em wanted\_by\_order} (i.e., their {\em demand}.) + \item Remove all entities whose total allocation has been satisfied. + \item Decrement the ``current order'' and repeat. + \end{itemize} + \item If any {\em wanted\_by\_order} has non-zero entries, repeat this entire procedure until either all of {\em wanted\_by\_order} + becomes zero, or until no more resources are assigned (meaning they have been exhausted). + \end{itemize} + + After this phase has been executed for every entity, every {\em RmJob} has a table of + ``counts'' which indicates the number of processes to be allocated to it. + +\subsubsection{What Of} + + The {\em What Of} phase attempts to find physical resources to match the ``counts'' from ``How Much''. Note + that we no longer deal with Users. We use ResourceClasses in this phase ONLY to find the correct + NodePool. The RmJob is the focal point of ``What Of''. + + The general mechanism is the same for all types of allocations at this point: collect all jobs + for each resource class, and ask the NodePool assigned to that class to find ``open'' spots + of the right size for every counted {\em nshare}. + + If the job is already fully allocated (it's ``counts'' are less-than or equal to the number of + processes it owns), this phase is done for that job. If not, the NodePool begins a search + among its resources for machines with sufficient space for the job. + + Note that pending preemptions are treated the same as non-preempted allocations. Until the Orchestrator + has confirmed that a process is completed, the RM assumes the space is still occupied. + + The nodepool search may be a recursive search, starting at the nodepool that is directly assigned to the + current job. If the job is non-preemptable, there is no recursion: the search occurs ONLY in the job's assigned nodepool. Otherwise + the search proceeds as follows: + + \begin{itemize} + \item For each job: + \item Set the ``current nodepool'' to the nodepool of the job's declared class. + \begin{itemize} + \item Collect all machines with sufficient capacity for the current job. + \item If a processes for the current job can be allocated, do so. + \item If no process can be allocated and there are ``child'' nodepools, set the + ``current nodepool'' to the next ``child''. + \item Repeat this iteration, descending through the child nodepools, until + a process is allocated or all descendants are exhausted. + \end{itemize} + \end{itemize} + +\subsubsection{Interlude: Preemption} + After {\em What Of}, we must initiate preemptions. This is relatively straightforward and performed + BEFORE {\em defragmentation}. It is performed by the method {\em doEvictions()}. + + The {\em NodePoolScheduler} iterates all FAIR\_SHARE jobs and checks their {\em given\_by\_order} + array against the number of processes actually allocated to the job. If they do not match it is + because + \begin{enumerate} + \item The job is expanding but the {\em What Of} phase could not find resources. + \item The job should shrink because the {\em How Much} phase reduced its fair-share + to make space for other work + \end{enumerate} + + If the job must shrink the RmJob's {\em shrinkBy()} method is called with the number of {\em nshares} it + must shed. The {\em RmJob} sorts its shares using investment and preempts the requisite number + of processes. + + The investment sort is actually a ``multiple'' sort, comparing data provided by the Orchestrator about + the {\em processes} assigned to the job. The + shares (representing physical {\em processes} here) + are sorted by {\em least investment} first as follows: + \begin{enumerate} + \item A share that has not completed initialization is ``less than'' any share that has completed + initialization. + \item If the two shares have not completed initialization, the share with least initialization time is ``less than'' + the other share. + \item If both shares have completed initialization, the share with lowest investment is ``less than'' the other share. + \item If both shares have the same investment, the share in the ``most deeply nested nodepool'' is ``less than'' the other + share. + \item Otherwise, the share with the lowest ID is ``less than'' the other share (the newest share has the lower ID). + \end{enumerate} + + NOTE: This is a significant simplification over the original eviction code. The original code is still + in the source files under {\em shrinkByInvestment()}, for reference, but it is no longer used. + + All preempted shares remain attached to their jobs and are NOT deallocated until the Orchestrator + confirms their exit. They are marked ``pending removal'' however, so the existing bookkeeping is able + to account for them during future preemption stages and defragmentation. + + NOTE: Once a share is published as ``preempted'' to the Orchestrator, it cannot be retrieved. Thus, if + the preemption takes a long time to complete, and the state of the system changes so the job can + re-expand, the preemption is NOT canceled. This can be observed occasionally in the logs as + jobs that are both shrinking and expanding simultaneously. + +\subsubsection{Defragmentation} + + Once preemption is complete the {\em defragmentation} phase begins. + + Because the ``counts'' from ``How Much'' are {\em abstract} counts, derived from an idealized set + of resources representing real, physical machines as presented by the NodePools, the ``What Of'' + phase will ALWAYS succeed in finding allocations, IF no preemptions are required and if there is no + fragmentation in the system. The ``What Of'' phase always attempts to minimize fragmentation by + using a simple bin-packing scheme that packs the largest allocations first and the smaller + allocations in the ``holes''. + + Here is a very simple example of fragmentation. + + Figure ~\ref{fig:rm-fragmentation-1} shows an ideal allocation of a two jobs of different sizes. Job A has been + assigned 5 {\em qshares} for 5 order-1 {\em nshares} (processes). Job B is assigned + 4 {\em qshares} for 2 order-2 {\em nshares}. Both jobs are fully allocated and 'What Of'' + will generally be successful accomplishing this. + + \begin{figure}[H] + \centering + \includegraphics[width=5.5in]{images/ducc-internals/rm-structure-1.png} + \caption{Unfragmented Layout} + \label{fig:rm-fragmentation-1} + \end{figure} + + However, as time proceeds and jobs come and go, it is possible that job A would get + allocated as in Figure ~\ref{fig:rm-fragmentation-2}. Now job B can only get 1 process: exactly HALF it's + ``deserved'' allocation. It would be necessary to preempt one of job A's processes to make space, + even though job A is not above its fair-share allocation. + + \begin{figure}[H] + \centering + \includegraphics[width=5.5in]{images/ducc-internals/rm-structure-2.png} + \caption{Fragmented Layout} + \label{fig:rm-fragmentation-2} + \end{figure} + + Of course this is a simplistic example. In general the situation is significantly more complex. + + The goal of {\em defragmentation} is to reconfigure job A as in Figure ~\ref{fig:rm-fragmentation-1} so that + job B can get its full allocation. + + The general procedure for defragmentation is as follows: + + \paragraph{Detection} This is performed in the method {\em detectFragmentation()}. + After ``What Of'', all jobs are iterated. Two numbers are derived for each job: + \begin{enumerate} + \item The number of ``deserved'' shares. During the ``How Much'' phase, + we perform a weighted fair-share assignment of resources. Often a job + cannot use its full ``fair share'' allotment; for example, it may be a + new job and only need two initial processes. The extra resources are + apportioned to other jobs which end up with MORE than their proper + weighted fair-share allotment. + + The ``deserved'' shares is a user's TRUE fair-share allotment, + calculated BEFORE bonus shares are allocated to it. This number is + calculated during the ``How Much'' phase and stored in each RmJob as + the {\em pure\_fair\_share}. + + \item The number of allocated shares. This number is calculated (in {\em RmJob}) as +\begin{verbatim} + shares_allocated + pending_expansions - pending_preemptions +\end{verbatim} + \end{enumerate} + + If the number of ``deserved'' shares is greater than the number of allocated + shares (accounting for expansion and preemption), the job is considered + ``potentially needy''. + + If there are no ``potentially needy'' jobs, {\em defragmentation} is done and + we can proceed to broadcast the schedule. + + The second goal of defragmentation is to minimize ``churn'' in the system. We + do NOT attempt to achieve a perfect layout. Instead, there is a threshold + minimum number of processes we try to guarantee every job. This number is configured + in {\em ducc.properties} as the {\em ducc.rm.fragmentation.threshold}. + + A pass is now made over every ``potentially needy'' job. Every such job with an + allocation that is greater than the {\em fragmentation threshold} is removed from the + ``needy'' list. All remaining jobs are considered ``actually needy''. + + If there are no ``actually needy'' jobs, {\em defragmentation} is done and we can + proceed to broadcast the schedule. + + Otherwise, the method {\em doFinalEvictions()} is called to try to make space for + ``actually needy'' jobs. We perform a ``take from the rich and give to the poor'' + procedure to insure that jobs whose allocation are below both their ``deserved fair share'' + and the ``fragmentation threshold'' are assigned additional resources. + + NOTE: This procedure works for non-preemptable allocations as well. For non-preemptable + allocations, the ``deserved'' value is exactly 1 {\em nshare} and any such job + with no allocations is considered ``actually needy''. + + We iterate all users and add up the total {\em qshares} occupied by all their jobs, + ordering the users by this value, known as their ``wealth''. + + We iterate the ``actually needy'' jobs. For each such job we iterate the ``wealthy'' users, + starting from the ``wealthiest'', inspecting their jobs to see if any of the processes are + allocated over resources that can be allocated to the needy job. Note that removal of a share + must NOT result an an otherwise non-needy job becoming ``needy''. If so, the user's wealth is + decremented and one of two things occurs: + \begin{enumerate} + \item If the selected process is a ``pending expansion'' that has not been published, + it is immediately reassigned to the needy job. (Note that this is an optimization and + the one exception to the rule that once a allocation is finalized in {\em RmJob} + it cannot be changed.) If the job is no longer needy it + is removed from the needy list. + \item Otherwise, the selected process is preempted and the needy job is placed on + a global ``needyJobs'' list. Jobs on this list get priority allocation BEFORE + any new allocations are made in all subsequent scheduling cycles, until they + are no longer needy. + \end{enumerate} + + Note the conditions which must be met by a process before it can be donated to a needy job + (verified in method {\em takeFromTheRich()}): + \begin{itemize} + \item The machine containing the share must be of sufficient {\em share order}. + \item The share must be preemptable. + \item The machine must be in a compatible nodepool. + \item If this share is evicted, the owning job must not become ``needy''. + \item If this share is evicted, it must leave sufficient space on the machine for the new + share. i.e, if it is impossible to clear enough space on the machine for the needy job, + there is no point evicting this share. We iterate all shares on the machine at this point + and try to evict sufficient shares (which of course must belong to ``wealthy'' users) to + make space for the needy share. + \end{itemize} + +\subsection{NodePool} + + The {\em NodePool} object manages physical nodes, represented in the RM by an + object called {\em Machine}. The collections of NodePools form a tree structure + with each ``nested'' nodepool managed as a ``child'' of its ``parent'' nodepool. + + There are many more methods in NodePool than are documented here. In this section + we only review the most important, or the most complicated methods. + + The RM supports multiple disjoint NodePools, known as ``top-level'' nodepools. The collection + of ``top-level'' nodepools partitions the entire nodespace into independently scheduled + resource collections. The motivation is to permit multiple, disparate collections of nodes to + be managed under a single DUCC system (rather than run multiple independent DUCC systems). + + Most of the NodePool algorithms are recursive. Both the {\em Scheduler} and + {\em NodePoolScheduler} object generally interact with the top NodePool of each + tree, which coordinates, through recursion, the direction of requests to the + correct, possibly nested target NodePool. + + For example, to count the machines in a nodepool, one generally wants the count of + machines in the pool PLUS the machines in its children: +\begin{verbatim} + /** + * How many do I have, including recusring down the children? + */ + int countMachines() + { + int count = allMachines.size(); + for ( NodePool np : children.values() ) { + count += np.countMachines(); + } + return count; + } +\end{verbatim} + + In the cases where recursion is to be inhibited, most of the methods are modified + with the name ``Local'': +\begin{verbatim} + /** + * Non-recursive machine count. + */ + int countLocalMachines() + { + return allMachines.size(); + } +\end{verbatim} + + Most of the methods in {\em NodePool} are short and easily understood, like the two above. There + are a few subtleties in NodePool which will be expanded upon below. + +\subsubsection{NodePool Reset} + All scheduling phases must be aware of what physical resources are available, which are in use, and + which are available for scheduling. As we proceed with scheduling we need to maintain scratch-space + that represents the current ``potential'' schedule, but without perturbing the existing allocations. + + The NodePool provides exactly this scratch space. Before the two main scheduling phases, ``How + Much'' and ``What Of'', the NodePool is instructed to reset(). The NodePool (and recursively, + the entire set of nested NodePools), drops all of its structures other than the most basic Machine + structures and then rebuilds them from the machine structures.. The scheduling phases then create + ``trial'' schedules, resetting the NodePool as often as necessary. + + This also has the side-effect that errors do not tend to accumulate in the system; we essentially + reboot the schedule on every pass. + + Finalizing the schedule is done in the Machine objects, with some help from the RmJob and + Share objects. + + (NOTE: as an optimization, RM does NOT generally rebuild Machine and RmJob from scratch on each + Orchestrator publication. They ARE rebuilt whenever RM starts, and during dynamic RM Reconfiguration). + + The next section, {\em Virtual Machines} provides a concrete example of the use of NodePool for + scratch space during scheduling. + +\subsubsection{Virtual Machines} + Probably the most important part of the scheduler is encapsulated in the NodePool method, + {\em rearrangeVirtual}. This method treats the collection of all ``real'' machines as + a collection of ``virtual'' machines which is the resource set that is scheduled against. + + In the RM's view, a ``virtual machine'' is any PROPER subset of a ``real machine''. (Recall + the mathematical definition of a PROPER subset is any subset of some set that is not equal + to that set.). + + As soon as an allocation of a single {\em nshare} is made against a machine, that machine's + capacity for further allocations is diminished until the allocation is released by the Orchestrator. For example, an + order-3 allocation against an ``real'' order-5 machine results in the diminution of the order-5 machine + to a ``virtual'' order-2 machine. To put it differently, making a 3-quantum allocation against a 5-quantum + ``real machine'' results in a 2-quantum ``virtual machine''. + + + To understand what {\em rearrangeVirtual()} does it is important to understand three tables. + These three tables are indexed by {\em share order} and are the key structures for both ``How + Much'' and ``What Of''. These tables are: + \begin{description} + \item[nMachinesByOrder] This table contains the number of full, free ``real machines'' with no allocations, + indexed by {\em share order} 1, 2, ... {\em maxorder}. + \item[vMachinesByOrder] This table contains the number of ``virtual machines'' indexed by {\em share order.} + \item[nSharesByOrder] This table contains the number of {\em nshares} of every order which can be + currently allocated. + \end{description} + + There is no overlap between ``nMachinesByOrder'' and ``vMachinesByOrder''. Therefore, the number + of schedulable ``machines'' of any kind for some specific order {\em O} is +\begin{verbatim} + nMachinesByOrder[O] + vMachinesByOrder[O] +\end{verbatim} + +\paragraph{nSharesByOrder} is derived from the two machine tables and the meaning of its values + is subtly different. The numbers in the machine tables are independent of each other. For example, if + there is a single order-5 ``real machine'', this does NOT imply that there is also an order-3 + ``virtual machine'' and an order-2 ``virtual machine''. This breakdown can only happen after + allocation. + + {\em nSharesByOrder} however, gives the number of {\em nshares} of an order that might be + allocated from any possible machine, real or virtual, allowing that a larger share may need to + be split. Each value in the table is dependent on the values of higher order in the table. For + example, if there is 1 order-5 ``real machine'', nSharesByOrder will indicate there is 1 + order-5 share available, or 1 order-4 share, or 1 order-3 share, or 2 order-2 shares, or 5 + order-1 shares. Here is an example of what these tables might look like at some point during scheduling: +\begin{verbatim} + Order 1 2 3 4 + ------------------- ---------------- + nMachinesByOrder[]; [ 0 2 0 1 4 ] - physical machines + vMachinesByOrder[]; [ 0 1 2 0 0 ] - virtual machines + nSharesByOrder[] ; [ 0 26 11 5 4 ] - collective N Shares for each order +\end{verbatim} + +\subsubsection{rearrangeVirtual(Machine M, order O)} + We can now explain this method. This is called when we wish to allocate a single + {\em nshare} of order {\em O} from machine {\em M}. The accounting works as follows: + if the machine has no allocations, decrement {\em nMachinesByOrder[O]} by one; else + decrement {\em vMachinesByOrder[O]} by one. If the allocation would cause the + free space to be split, calculate the order of the free space after allocation and + increment the correct value in {\em vMachinesByOrder} like this: + + \paragraph{First Step}: Update the two machine tables. +\begin{verbatim} + int v_order = M.getVirtualShareOrder(); // How much free space in the machine? + int r_order = M.getShareOrder(); // How much total space in the machine? + + if ( v_order == r_order ) { // Free == total? + nMachinesByOrder[r_order]--; // Yes, full machine allocation + } else { + vMachinesByOrder[v_order]--; // No, virt machine allocation + } + + v_order -= O; // Does it cause a split? + if ( v_order > 0 ) { // Yes + vMachinesByOrder[v_order]++; // Add a "new", smaller virt machine + } +\end{verbatim} + There are, of course, additional details, which can be seen by inspecting the + full source listing. + + \paragraph{Second Step} Update the share table. We initialize the table with the total of real + and virtual machines by order. Then in a double iteration, look ``forward'' to count the number + of shares that might be acquired from higher order allocations by splitting the space. The full + method is included here for the curious. Everyone else can simply trust that it is correct. +\begin{verbatim} + protected void calcNSharesByOrder() + { + int len = nMachinesByOrder.length; + + // init nSharesByorder to the sum of 'n and 'v MachinesByOrder + System.arraycopy(nMachinesByOrder, 0, nSharesByOrder, 0, len); + for ( int i = 0; i < getMaxOrder() + 1; i++ ) { + nSharesByOrder[i] += vMachinesByOrder[i]; + } + + for ( int o = 1; o < len; o++ ) { // counting by share order + for ( int p = o+1; p < len; p++ ) { + if ( nSharesByOrder[p] != 0 ) { + nSharesByOrder[o] += (p / o) * nSharesByOrder[p]; + } + } + } + } +\end{verbatim} + +\subsubsection{connectShare(Share s, Machine m, IRmJob j, int order)} + This helper method is responsible for updating all the records in order to + allocate a specific share on a specific machine for a specific job. Its + action is irreversible: once this method is called, the share is irrevocably + assigned to the given job on the given machine (except sometimes, during + defragmentation, as described above). + + {\em rearrangeVirtual()} is called at the end to update the internal ``counts''. + +\subsubsection{compatibleNodepool(Policy p, ResourceClass rc)} + This method determines if the current nodepool is compatible with the indicated + scheduling policy and resource class. If the policy is FAIR\_SHARE, recursion + through the child nodes is performed. + +\subsubsection{nodeArrives} + This straightforward method adds a node to the list of schedulable nodes. It updates the + database, deals with unresponsive nodes becoming responsive again, and does + simple bookeeping. + +\section{RmJob} + + The RmJob is mostly an accounting object. While its implementation has many details, + there are two important methods: {\em calcJobCaps()} and {\em shrinkBy(int count)}, both + of which were briefly mentioned above. + +\subsection{calcJobCaps()} + If the {\em rearrangeVirtual()} code described above has a rival for ``most important method'', + it would be the RmJob's {\em calcJobCaps()}. This method is called many times throughout + scheduling and is required to return {\em exactly} the number of shares the job could make + use of at the current moment, if there were unbounded resources. + + Note that this is the method to modify if you wish to change the rate of expansion or + contraction of a job. + + Because it is called so often, the scheduler iterates all jobs at the start of each + scheduling cycle and calls {\em initJobCap()} to calculate the cap based on current job + state. This caches the actual cap, which is returned in subsequent calls to + {\em calcJobCaps()}. + + The design point is this: Estimate the cap as the largest value that is meaningful. Then + whittle it down to the minimum by applying the architected constraints such as + the ``initialization cap'' and prediction of when we expect the job to complete. We want + everything we can get but no more than we can use. + + This code can be tricky to understand so we'll present it here. The returned ``actual\_cap'' is + the value used by NodePoolScheduler's ``How Much'' phase for all {\em entities} to determine share allocations. + + The following steps are taken by {\em initJobCap()}: + \begin{enumerate} + \item If the job is unschedulable (refused), set cap to 0 and return. (No shares will be allocated.) + \item If the job is completed but not yet deallocated, set the cap to the total shares + it already has allocated and return. (No additional shares will be allocated.) + \item Set the tentative cap to the number of remaining {\em work items} divided by the declared + threads per processes. This is the upper bound on the cap: +\begin{verbatim} + c = (n_remaining_questions / nthreads} +\end{verbatim} + + \item Adjust the tentative cap to the maximum of ``c'' and the number of shares already + allocated. This accounts for jobs ``winding down'' when work items start to vacate + processes so we have more processes than are needed for the remaining work but we + want to insure that ``How Much'' does not cause premature shrinkage. +\begin{verbatim} + int currentResources = countNShares(); + c = Math.max(c, currentResources); +\end{verbatim} + + \item Adjust the tentative cap to the minimum of ``c'' and the declared {\em process\_deployments\_max}. + Call this the ``base cap''. It is the job cap before accounting for prediction and is + used if we cannot find a better estimate. +\begin{verbatim} + int base_cap = Math.min(getMaxShares(), c); +\end{verbatim} + + \item Predict the number of shares this job could use on an unbounded system, + based on the average initialization time of its processes and the rate of completion + of the work items so far. Call this the ``projected\_cap''. +\begin{verbatim} + int projected_cap = getProjectedCap(); + if ( projected_cap == 0 ) { // we know nothing, this is best guess + projected_cap = base_cap; + } +\end{verbatim} + + \item All else being equal, the potential cap for the job is now the max of the actual + resources we have allocated, and the projected cap. It is the largest number of + resources we believe the job can ever use. +\begin{verbatim} + potential_cap = Math.max(projected_cap, currentResources); +\end{verbatim} + + \item If we're still initializing, and we have configured {\em ducc.rm.initialization.cap} + in {\em ducc.properties}, revise the cap down and return the {\em actual\_cap}. +\begin{verbatim} + actual_cap = Math.min(potential_cap, (resource_class.getInitializationCap())); +\end{verbatim} + + \item If we're still initializing and we do NOT have an initiation cap configured, + set the {\em actual\_cap} to the {\em potential\_cap} and return. +\begin{verbatim} + actual_cap = potential_cap +\end{verbatim} + + \item If we've completed at least one initialization, and we have configured + {\em ducc.rm.expand.by.doubling}, return the smaller of the {\em potential\_cap} + and TWICE the currently allocated resources: +\begin{verbatim} + actual_cap = Math.min(potential_cap, currentResources * 2); +\end{verbatim} + + \item If we've completed at least one initialization, and we do NOT use + expand-by-doubling, return the {\em potential\_cap} +\begin{verbatim} + actual_cap = potential_cap +\end{verbatim} + + \item There is one last corner case. It is possible the job has + shrunk to 0 resources (pushed out by fair-share for example). If + this has happened we have to restart the doubling, and we need to + ask for at least the initialization cap. But we don't want to go + over the ``base\_cap'' which has accounted for the fact the job might + be running down and we can't use the full initialization cap. +\begin{verbatim} + if ( currentResources == 0 ) { + actual_cap = Math.max(1, resource_class.getInitializationCap()); + actual_cap = Math.min(base_cap, actual_cap); + } +\end{verbatim} + + + \end{enumerate} + +\subsection{shrinkBy(int count)} + This is a rather trivial method, used to implement ``shrink by investment''. Originally + this was a much more involved processes, which gradually became refined to its current + incarnation. + + All this method does is sort the RmJob shares as described in the interlude above, ``Preemptions'', + and deletes the indicated number of shares from the front of the sorted list. + + The original {\em shrinkByInvestment()} code has been left in place for reference. + +\section{Supporting Classes} + There are a number of supporting classes mostly used for bookkeeping, mentioned here for completeness. + +\subsection{Machine} +This represents a Node. A Machine object is created whenever a Node's state arrives. The Machine +is entered into an appropriate nodepool. Machine objects are NEVER destroyed (except during dynamic +reconfiguration) as it is usually expected that an unresponsive machine will become responsive +again. This significantly reduces the complexity of bookkeeping. + +\subsubsection{Blacklists and Whitelists} +The Machine maintains a list of {\em Share}s allocated to it. It is possible, after changing the +{\em ducc.classes} configuration and starting RM, that it is no longer legal for these shares to be +allocated on this machine, or perhaps to be allocated at all. For example the machine may have been +moved to a different class than the class of the work allocated on it, or the class may be been +deleted entirely. + + If this happens the shares are essentially in ``limbo''. They cannot (in general) be associated + with any resource class and therefore cannot participate in allocations (recall, allocations are + done by resource class). The space must nonetheless be accounted for to avoid double-booking the nodes. + + To cope with this the RM considers both the {\em Shares}, and the {\em Machine} they reside on + to be ``blacklisted''. When a machine is ``blacklisted'', + \begin{itemize} + \item All work that can be evicted from it is evicted. This include any kind of UIMA-AS + job (including jobs submitted to non-preemptable classes), and Services. + \item No additional allocations can be made to the machine until ALL blacklisted work + has been confirmed by the Orchestrator to have left the system. + \end{itemize} + + Once all blacklisted work on a machine has left the system, the machine is ``white-listed'' and + allocations on it are resumed. + +\subsubsection{Database and Persistence} + When any machine arrives in the system, a new record is entered in the database containing its + essential data. + + All state subsequent changes for the machine are entered into the database, including the number + of missed consecutive Agent heartbeats. + + When a share is assigned to a machine, or leaves a machine, it is the responsibility of the Machine object to + record the share and its details in the database. + +\subsection{Share} + The Share object represents one full allocation. Internally it is an {\em nshare} and thus + has share order, where the {\em share order} is the number of {\em qshares} it represents. A + share is logically exposed outside of RM as a Process. + + The Share's main purpose is bookkeeping; a place to store investment, initialization time and + to represent the space occupied by a resource allocation. + +\subsection{ResourceClass} + + The ResourceClass represents the {\em class} concept as configured in {\em ducc.classes}. It + holds the configured class rules (expand\_by\_doubling, initialization\_cap, etc). + + It's primary purpose is bookkeeping; a place to organize jobs by class, jobs by user by class, + to maintain the set of users authorized for the class, etc. It also tracks non-preemptable + share {\em allotment}. + + The {\em ResourceClass} is a schedulable {\em IEntity}, as described above in the description + of the FAIR\_SHARE algorithm. + +\subsection{User} + + The User represents a single user. Its primary purpose is bookkeeping; a place to organize + jobs owned by the user. + + The User is a schedulable {\em IEntity}, as described above in the description of the + FAIR\_SHARE algorithm. + +\subsection{JobManagerUpdate} + + This is a ``transfer object'' used to transfer the current schedule to the publication + mechanism and ultimately to the Orchesrator. It consist of maps of all shares, organized + by shares ``expanded'', and ``shrunken'' (preempted). The RM's publication mechanism + translates this into the appropriate format which then gets published to the Orchestrator. +