GlenGeng opened a new pull request #1518:
URL: https://github.com/apache/hadoop-ozone/pull/1518


   ## What changes were proposed in this pull request?
   
   In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.
   
   **The root cause is:**
   1. EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
   ```
   executorService = Executors.newFixedThreadPool(
       getEndPointTaskThreadPoolSize(),
       new ThreadFactoryBuilder()
           .setNameFormat("Datanode State Machine Task Thread - %d").build());
   ```
   
   ```
   private int getEndPointTaskThreadPoolSize() {
     // TODO(runzhiwang): current only support one recon, if support multiple
     //  recon in future reconServerCount should be the real number of recon
     int reconServerCount = 1;
     int totalServerCount = reconServerCount;
   
     try {
       totalServerCount += HddsUtils.getSCMAddresses(conf).size();
     } catch (Exception e) {
       LOG.error("Fail to get scm addresses", e);
     }
   
     return totalServerCount;
   }
   ```
   meanwhile, current Recon has some performance issue, after running for 
hours, it became slower and slower, and crashed due to OOM. 
   2. The communication between DN and Recon will soon exhaust all the threads 
in DatanodeStateMachine.executorService, there will be no available threads for 
DN to talk SCM. 
   3. all DNs become stale/dead at SCM side.
    
   **The fix is quite straightforward:**
   Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon, a 
slow Recon won't interfere the communication between DN and SCM, or vice versa.
    
   **P.S.**
   The first edition for` DatanodeStateMachine.executorService` is a cached 
thread pool, if there exists a slow SCM/Recon, more and more threads will be 
created, and DN will OOM eventually, due to tens of thousands of threads are 
created.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-4386
   
   Please replace this section with the link to the Apache JIRA)
   
   CI
   
   (Please explain how this patch was tested. Ex: unit tests, manual tests)
   (If this patch involves UI changes, please attach a screen-shot; otherwise, 
remove this)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

Reply via email to