[jira] [Updated] (HDDS-4386) Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon

ASF GitHub Bot (Jira) Fri, 23 Oct 2020 00:24:48 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated HDDS-4386:
---------------------------------
    Labels: pull-request-available  (was: )

> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon
> -------------------------------------------------------------------------
>
>                 Key: HDDS-4386
>                 URL: https://issues.apache.org/jira/browse/HDDS-4386
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>    Affects Versions: 1.1.0
>            Reporter: Glen Geng
>            Assignee: Glen Geng
>            Priority: Blocker
>              Labels: pull-request-available
>
> In Tencent production environment, after start Recon for a while, we got 
> warnings that all DNs become stale/dead at SCM side. After kill recon, all 
> DNs become healthy in a very short time.
>  
> *The root cause is:*
> 1) EndpointStateMachine for SCM and that for Recon share the thread pool 
> created by DatanodeStateMachine, which is a fixed size thread pool:
> {code:java}
> executorService = Executors.newFixedThreadPool(
>     getEndPointTaskThreadPoolSize(),
>     new ThreadFactoryBuilder()
>         .setNameFormat("Datanode State Machine Task Thread - %d").build());
> private int getEndPointTaskThreadPoolSize() {
>   // TODO(runzhiwang): current only support one recon, if support multiple
>   //  recon in future reconServerCount should be the real number of recon
>   int reconServerCount = 1;
>   int totalServerCount = reconServerCount;
>   try {
>     totalServerCount += HddsUtils.getSCMAddresses(conf).size();
>   } catch (Exception e) {
>     LOG.error("Fail to get scm addresses", e);
>   }
>   return totalServerCount;
> }
> {code}
> meanwhile, current Recon has some performance issue, after running for hours, 
> it became slower and slower, and crashed due to OOM. 
> 2) The communication between DN and Recon will soon exhaust all the threads 
> in DatanodeStateMachine.executorService, there will be no available threads 
> for DN to talk SCM. 
> 3) all DNs become stale/dead at SCM side.
>  
> *The fix is quite straightforward:*
> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon, a 
> slow Recon won't interfere the communication between DN and SCM, or vice 
> versa.
>  
> *P.S.*
> The first edition for DatanodeStateMachine.executorService is a cached thread 
> pool, if there exists a slow SCM/Recon, more and more threads will be 
> created, and DN will OOM eventually, due to tens of thousands of threads are 
> created.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-4386) Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon

Reply via email to