[ 
https://issues.apache.org/jira/browse/HDDS-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aravindan Vijayan updated HDDS-4539:
------------------------------------
    Description: 
On a cluster with millions of containers or hundreds of Datanodes, it will take 
some time for Recon to reach a steady state (all active DNs and Containers 
reported). If the container health task is run before this, it can incorrectly 
flag most of the containers as missing. This was seen in a cluster where Recon 
reaching steady state is slow due to HDDS-4403, and it also leads to the UI 
problem mentioned in HDDS-4402. 


We need to make sure the container health task is not run before cluster has 
reached steady state. This could be a fixed wait time (~10mins) or by checking 
Recon's SCM state.

  was:On a cluster with millions of containers or hundreds of Datanodes, it 
will take some time for Recon to reach a steady state (all active DNs and 
Containers reported). If the container health task is run before this, it can 
incorrectly flag most of the containers as missing. This also leads to an other 
problem mentioned in HDDS-4402. We need to make sure the container health task 
is not run before cluster has reached steady state. This could be a fixed wait 
time (~10mins) or by checking Recon's SCM state.


> Container Health Task should not run until Recon has reached steady state.
> --------------------------------------------------------------------------
>
>                 Key: HDDS-4539
>                 URL: https://issues.apache.org/jira/browse/HDDS-4539
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: Ozone Recon
>            Reporter: Aravindan Vijayan
>            Priority: Major
>
> On a cluster with millions of containers or hundreds of Datanodes, it will 
> take some time for Recon to reach a steady state (all active DNs and 
> Containers reported). If the container health task is run before this, it can 
> incorrectly flag most of the containers as missing. This was seen in a 
> cluster where Recon reaching steady state is slow due to HDDS-4403, and it 
> also leads to the UI problem mentioned in HDDS-4402. 
> We need to make sure the container health task is not run before cluster has 
> reached steady state. This could be a fixed wait time (~10mins) or by 
> checking Recon's SCM state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to