[GitHub] flink pull request #2540: [FLINK-4606] [cluster management] Integrate the ne...

beyond1920 Tue, 27 Sep 2016 20:02:49 -0700

Github user beyond1920 commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2540#discussion_r80836325
  
    --- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java
 ---
    @@ -324,6 +337,158 @@ public void handleError(final Exception exception) {
                shutDown();
        }
     
    +   /**
    +    * Registers an infoMessage listener
    +    *
    +    * @param infoMessageListenerAddress address of infoMessage listener to 
register to this resource manager
    +    */
    +   @RpcMethod
    +   public void registerInfoMessageListener(final String 
infoMessageListenerAddress) {
    +           
if(infoMessageListeners.containsKey(infoMessageListenerAddress)) {
    +                   log.warn("Receive a duplicate registration from info 
message listener on ({})", infoMessageListenerAddress);
    +           } else {
    +                   Future<InfoMessageListenerRpcGateway> 
infoMessageListenerRpcGatewayFuture = 
getRpcService().connect(infoMessageListenerAddress, 
InfoMessageListenerRpcGateway.class);
    +
    +                   infoMessageListenerRpcGatewayFuture.thenAcceptAsync(new 
AcceptFunction<InfoMessageListenerRpcGateway>() {
    +                           @Override
    +                           public void 
accept(InfoMessageListenerRpcGateway gateway) {
    +                                   log.info("Receive a registration from 
info message listener on ({})", infoMessageListenerAddress);
    +                                   
infoMessageListeners.put(infoMessageListenerAddress, gateway);
    +                           }
    +                   }, getMainThreadExecutor());
    +
    +                   
infoMessageListenerRpcGatewayFuture.exceptionallyAsync(new 
ApplyFunction<Throwable, Void>() {
    +                           @Override
    +                           public Void apply(Throwable failure) {
    +                                   log.warn("Receive a registration from 
unreachable info message listener on ({})", infoMessageListenerAddress);
    +                                   return null;
    +                           }
    +                   }, getMainThreadExecutor());
    +           }
    +   }
    +
    +   /**
    +    * Unregisters an infoMessage listener
    +    *
    +    * @param infoMessageListenerAddress address of infoMessage listener to 
unregister from this resource manager
    +    *
    +    */
    +   @RpcMethod
    +   public void unRegisterInfoMessageListener(final String 
infoMessageListenerAddress) {
    +           infoMessageListeners.remove(infoMessageListenerAddress);
    +   }
    +
    +   /**
    +    * Shutdowns cluster
    +    *
    +    * @param finalStatus
    +    * @param optionalDiagnostics
    +    */
    +   @RpcMethod
    +   public void shutDownCluster(final ApplicationStatus finalStatus, final 
String optionalDiagnostics) {
    +           log.info("shut down cluster because application is in {}, 
diagnostics {}", finalStatus, optionalDiagnostics);
    +           shutDownApplication(finalStatus, optionalDiagnostics);
    +   }
    +
    +   /**
    +    * This method should be called by the framework once it detects that a 
currently registered task executor has failed.
    +    *
    +    * @param resourceID Id of the worker that has failed.
    +    * @param message An informational message that explains why the worker 
failed.
    +    */
    +   public void notifyWorkerFailed(final ResourceID resourceID, String 
message) {
    +           runAsync(new Runnable() {
    +                   @Override
    +                   public void run() {
    +                           WorkerType worker = 
taskExecutorGateways.remove(resourceID);
    +                           if (worker != null) {
    +                                   // TODO :: suggest failed task executor 
to stop itself
    +                                   
slotManager.notifyTaskManagerFailure(resourceID);
    +                           }
    +                   }
    +           });
    +   }
    +
    +   /**
    +    * Gets the number of currently started TaskManagers.
    +    *
    +    * @return The number of currently started TaskManagers.
    +    */
    +   public int getNumberOfStartedTaskManagers() {
    +           return taskExecutorGateways.size();
    +   }
    +
    +   /**
    +    * Notifies the resource manager of a fatal error.
    +    *
    +    * <p><b>IMPORTANT:</b> This should not cleanly shut down this master, 
but exit it in
    +    * such a way that a high-availability setting would restart this or 
fail over
    +    * to another master.
    +    */
    +   public void onFatalError(final String message, final Throwable error) {
    +           runAsync(new Runnable() {
    +                   @Override
    +                   public void run() {
    +                           fatalError(message, error);
    +                   }
    +           });
    +   }
    +
    +   // 
------------------------------------------------------------------------
    +   //  Framework specific behavior
    +   // 
------------------------------------------------------------------------
    +
    +   /**
    +    * Initializes the framework specific components.
    +    *
    +    * @throws Exception Exceptions during initialization cause the 
resource manager to fail.
    +    */
    +   protected abstract void initialize() throws Exception;
    +
    +   /**
    +    * Callback when a task executor register.
    +    *
    +    * @param resourceID The worker resource id
    +    * @param taskExecutorGateway the task executor gateway
    +    */
    +   protected abstract WorkerType workerStarted(ResourceID resourceID, 
TaskExecutorGateway taskExecutorGateway);
    --- End diff --
    
    @mxm, I omit these method before because of following reasons:
    
    1. Maybe we need   `requestNewWorkers` method. But numWorkers parameter is 
not enough to allocate a certain number of new workers, the expected 
ResourceProfile of each worker also needed to pass in.
    
    2. Maybe we need `releaseStartedWorker` method. I omit the method because 
it is used to release started taskExecutors when ResourceManager receives 
RemoveResource request, but I could not find any places where sends this 
request. So I omitted this method before. 
    
    3. We don't need `reacceptRegisteredWorkers` method. Because the method was 
used to consolidate the taskExecutor view between resourceManager and 
jobManager when resourceManager reconnects to jobManager after resourceManager 
restart. But in the new cluster management mode, JobManager doesn't kept the 
view of live taskExecutors. ResourceManager is responsible for receiving 
taskExecutors' registration and maintain the taskExecutor view.  So we don't 
need this method.
    
    4. We don't need `releasePendingWorker` method. Because the method was only 
used to release pending requests when resourceManager consolidates the 
taskExecutor view with jobManager after resourceManager restart.  As we said 
before, this process is not needed in new cluster management mode.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2540: [FLINK-4606] [cluster management] Integrate the ne...

Reply via email to