[ https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HDDS-4386: --------------------------------- Labels: pull-request-available (was: ) > Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon > ------------------------------------------------------------------------- > > Key: HDDS-4386 > URL: https://issues.apache.org/jira/browse/HDDS-4386 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Affects Versions: 1.1.0 > Reporter: Glen Geng > Assignee: Glen Geng > Priority: Blocker > Labels: pull-request-available > > In Tencent production environment, after start Recon for a while, we got > warnings that all DNs become stale/dead at SCM side. After kill recon, all > DNs become healthy in a very short time. > > *The root cause is:* > 1) EndpointStateMachine for SCM and that for Recon share the thread pool > created by DatanodeStateMachine, which is a fixed size thread pool: > {code:java} > executorService = Executors.newFixedThreadPool( > getEndPointTaskThreadPoolSize(), > new ThreadFactoryBuilder() > .setNameFormat("Datanode State Machine Task Thread - %d").build()); > private int getEndPointTaskThreadPoolSize() { > // TODO(runzhiwang): current only support one recon, if support multiple > // recon in future reconServerCount should be the real number of recon > int reconServerCount = 1; > int totalServerCount = reconServerCount; > try { > totalServerCount += HddsUtils.getSCMAddresses(conf).size(); > } catch (Exception e) { > LOG.error("Fail to get scm addresses", e); > } > return totalServerCount; > } > {code} > meanwhile, current Recon has some performance issue, after running for hours, > it became slower and slower, and crashed due to OOM. > 2) The communication between DN and Recon will soon exhaust all the threads > in DatanodeStateMachine.executorService, there will be no available threads > for DN to talk SCM. > 3) all DNs become stale/dead at SCM side. > > *The fix is quite straightforward:* > Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon, a > slow Recon won't interfere the communication between DN and SCM, or vice > versa. > > *P.S.* > The first edition for DatanodeStateMachine.executorService is a cached thread > pool, if there exists a slow SCM/Recon, more and more threads will be > created, and DN will OOM eventually, due to tens of thousands of threads are > created. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org