[ https://issues.apache.org/jira/browse/SLIDER-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012184#comment-15012184 ]
MENG DING commented on SLIDER-938: ---------------------------------- I apologize for the very late follow up, as I got sidetracked onto some other tasks. I have done a prototype of the container resize feature in Slider. The following is my proposal up for discussion: h3. Overall Strategy Provide Java API and CLI to allow changing resource of a running container. To [~ste...@apache.org]: I don't quite understand option #2 you proposed. Do you mean manual update of resources.json, followed by a slider update? If so, that cannot change the resource of a specific container that is still running, can it? h3. User Interface h5. CLI Two options to consider for CLI: # Extend the existing *slider flex*: {{slider flex <application> --id <id> --memory <memory> --vcores <vcores>}} # Add a new CLI: {{slider resize-container <application> --id <id> --memory <memory> --vcores <vcores>}} The {{id}} option is mandatory. At least one resource dimension must be specified (i.e., {{memory}} and/or {{vcores}}). The non-specified dimension will be unchanged. If {{vcores}} is specified, the underlying YARN cluster must have *DominantResourceCalculator* configured. For example, {{slider resize-container memcached --id container_e17_1447786590303_0002_01_000002 --memory 3072}} instructs slider to change the memory allocation of container container_e17_1447786590303_0002_01_000002 from the memcached application to 3072 , while keeping the number of vcores of that container unchanged. Option 2 is preferred IMO, as it is cleaner and less error-prone. If we agree on this option, the following API, protocol changes will all be based on this option. If not, I will update the design once we reach agreement on other options. h5. API {code:title=org.apache.slider.client.SliderClientAPI} /** * Change resource of a specific running container in the cluster * @param name cluster name * @param args arguments * @return exit code * @throws YarnException * @throws IOException */ int actionResizeContainer(String name, ActionResizeContainerArgs args) throws YarnException, IOException; {code} h3. Protocol Changes between Slider client and Slider AppMaster {code:title=SliderClusterProtocol.proto} service SliderClusterProtocolPB { ... /** * change resource of a container */ rpc resizeContainer(ResizeContainerRequestProto) returns(ResizeContainerResponseProto); ... } {code} {code:title=SliderClusterMessages.proto} ... message ResourceProto { optional int32 memory = 1; optional int32 virtual_cores = 2; } message ResizeContainerRequestProto { optional string id = 1; optional ResourceProto resource = 2; } message ResizeContainerResponseProto { optional bool success = 1; } ... {code} Different than YARN, Slider commits the auto-generated code from protobuf into git. Once the protocol change is approved, I will compile the code once with the {{-Pcompile-protobuf}} option, and generate the following files to be committed: {{slider-core/src/main/java/org/apache/slider/api/proto/Messages.java}} {{slider-core/src/main/java/org/apache/slider/api/proto/SliderClusterAPI.java}} h3. Slider AppMaster Changes # Currently {{SliderAppMaster}} directly implements both {{AMRMClientAsync.CallbackHandler}} and {{NMClientAsync.CallbackHandler}} interfaces, which have been deprecated in YARN-1509 and YARN-1510, and replaced by two abstract classes: {{AMRMClientAsync.AbstractCallbackHandler}} and {{NMClientAsync.AbstractCallbackHandler}}. Since we cannot extend multiple classes, we need to create nested classes inside {{SliderAppMaster}} to handle callbacks, similar to the implementation in the {{DistributedShell}} example in YARN: {code} SliderAppMaster.RMCallbackHandler extends AMRMClientAsync.AbstractCallbackHandler SliderAppMaster.NMCallbackHandler extends NMClientAsync.AbstractCallbackHandler {code} This means, of course, that the code must be compiled against Hadoop 2.8+ # The new {{AMRMClientAsync.requestContainerResourceChange}} and {{NMClientAsync.increaseContainerResource}} APIs require that the application master keeps track of all allocated containers, and passes the containers as parameters when calling these functions. The good thing is that Slider already tracks these containers in {{RoleInstance.container}} object. # The high-level calling sequence: * Sending request: {code} SliderIPCService.resizeContainer --> queue(ActionRequestContainerResize) --> ... --> SliderAppMaster.requestContainerResourceChange --> AsyncRMOperationHandler.requestContainerResourceChange --> AMRMClientAsync.requestContainerResourceChange {code} * Processing response: {code} SliderAppMaster.RMCallbackHandler.onContainersResourceChanged --> AppState.onContainersResourceChanged --> queue(ActionIncreaseContainerResource) --> SliderAppMaster.increaseContainerResource --> NMClientAsync.increaseContainerResource {code} h3. Other Considerations # Unlike flex, container resize information cannot be persisted for application restart. The current Slider framework only persist instance location through RoleHistory. It would be rather complicated to persist memory allocation information for restart of *each* role instance. This can be left for future enhancement if needed. # REST API support: * Show real-time container resource allocation information: Currently, the {{slideram/stats}} endpoint already provides a link to individual Container which shows the effective memory allocation of the container. Note: this is different than the {{Role Options:yarn.memory}} information displayed in the {{slideram/stats}} endpoint, which only shows the static value specified in the resources.json * Container resize through REST API: This will be supported at a later time when SLIDER-151 is completed. > Add ability to resize containers (Hadoop 2.8+) > ---------------------------------------------- > > Key: SLIDER-938 > URL: https://issues.apache.org/jira/browse/SLIDER-938 > Project: Slider > Issue Type: New Feature > Components: appmaster > Reporter: Steve Loughran > Assignee: MENG DING > Labels: hadoop-2.8 > > Hadoop 2.8 will add container resize in YARN-1197: support that for dynamic > container resize -- This message was sent by Atlassian JIRA (v6.3.4#6332)