[ 
https://issues.apache.org/jira/browse/HDDS-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838920#comment-16838920
 ] 

Hrishikesh Gadre commented on HDDS-1201:
----------------------------------------

h2. *_Technical Description: Details on the technical approach planned_*

High level workflow
 * The data-node will compute the list of corrupted containers as a background 
activity.
 * This list of corrupted container ids will be shared with SCM as part of the 
next heartbeat message.
 * The SCM will process this list and mark the corresponding replica as 
corrupted.
 * Currently the state of the container replicas is stored in-memory only (and 
not persisted to disk). This feature does not change that model. That means if 
the SCM crashes and comes back again, it will lose the knowledge of corrupted 
containers and will need to be rebuilt over a period of time.
 * The SCM will provide metrics about the corrupted container replicas via JMX 
API

 

Out-of-scope work items
 * Ability to take corrective action (e.g. schedule container replication) when 
a corrupted replica is reported.

 

*Leverage Incremental Container Report functionality*

DataNode changes:
 * Ensure that the Data scrubbing framework in DataNode should mark container 
as unhealthy and send the ICR as part of that step.

SCM changes:
 * SCM should filter the unhealthy replicas when a client requests for replicas 
for a given container.
 * Add an API in SCMMXBean to get an aggregated count of corrupted container 
replicas (along with the concrete implementation in StorageContainerManager.

> Reporting Corruptions in Containers to SCM
> ------------------------------------------
>
>                 Key: HDDS-1201
>                 URL: https://issues.apache.org/jira/browse/HDDS-1201
>             Project: Hadoop Distributed Data Store
>          Issue Type: Sub-task
>          Components: Ozone Datanode, SCM
>            Reporter: Supratim Deka
>            Assignee: Hrishikesh Gadre
>            Priority: Major
>
> Add protocol message and handling to report container corruptions to the SCM.
> Also add basic recovery handling in SCM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to