GlenGeng opened a new pull request #1518: URL: https://github.com/apache/hadoop-ozone/pull/1518
## What changes were proposed in this pull request? In Tencent production environment, after start Recon for a while, we got warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs become healthy in a very short time. **The root cause is:** 1. EndpointStateMachine for SCM and that for Recon share the thread pool created by DatanodeStateMachine, which is a fixed size thread pool: ``` executorService = Executors.newFixedThreadPool( getEndPointTaskThreadPoolSize(), new ThreadFactoryBuilder() .setNameFormat("Datanode State Machine Task Thread - %d").build()); ``` ``` private int getEndPointTaskThreadPoolSize() { // TODO(runzhiwang): current only support one recon, if support multiple // recon in future reconServerCount should be the real number of recon int reconServerCount = 1; int totalServerCount = reconServerCount; try { totalServerCount += HddsUtils.getSCMAddresses(conf).size(); } catch (Exception e) { LOG.error("Fail to get scm addresses", e); } return totalServerCount; } ``` meanwhile, current Recon has some performance issue, after running for hours, it became slower and slower, and crashed due to OOM. 2. The communication between DN and Recon will soon exhaust all the threads in DatanodeStateMachine.executorService, there will be no available threads for DN to talk SCM. 3. all DNs become stale/dead at SCM side. **The fix is quite straightforward:** Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon, a slow Recon won't interfere the communication between DN and SCM, or vice versa. **P.S.** The first edition for` DatanodeStateMachine.executorService` is a cached thread pool, if there exists a slow SCM/Recon, more and more threads will be created, and DN will OOM eventually, due to tens of thousands of threads are created. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-4386 Please replace this section with the link to the Apache JIRA) CI (Please explain how this patch was tested. Ex: unit tests, manual tests) (If this patch involves UI changes, please attach a screen-shot; otherwise, remove this) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org