[ 
https://issues.apache.org/jira/browse/HDFS-11146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16195296#comment-16195296
 ] 

Daryn Sharp commented on HDFS-11146:
------------------------------------

I think it looks ok, but need to think through a few use cases.   I was 
originally thinking about this from a RU perspective since we already force 
FBRs to accelerate clearing staleness after restarting the DN.  That's safe.

The problem is a non-RU failover might not be safe.  The stale check prevents 
data loss when DNs have queued invalidations, failover occurs, new active NN 
issues its own invalidations to different DNs.  Best case, block becomes under 
highly under-replicated and corrected.  Worst case, NN deletes all replicas...

Kihwal thinks the DN might remove the replica from its map when queueing the 
invalidation.  If so, that might solve the race with the FBR that clears the 
staleness lagging the pending invalidations.  Another option may be to flush 
the async invalidation queue when a new active is detected via heartbeat 
response.  At any rate, we need to ensure there's some mechanism to prevent 
aggressive de-stalination (I just created and own that term) from jeopardizing 
durability. 

> Excess replicas will not be deleted until all storages's FBR received after 
> failover
> ------------------------------------------------------------------------------------
>
>                 Key: HDFS-11146
>                 URL: https://issues.apache.org/jira/browse/HDFS-11146
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Brahma Reddy Battula
>            Assignee: Brahma Reddy Battula
>         Attachments: HDFS-11146-002.patch, HDFS-11146-003.patch, 
> HDFS-11146-004.patch, HDFS-11146-005.patch, HDFS-11146.patch
>
>
> Excess replicas will not be deleted until all storages's FBR received after 
> failover.
> Thinking following soultion can help.
>  *Solution:* 
> I think after failover, As DNs aware of failover ,so they can send another 
> block report (FBR) irrespective of interval.May be some shuffle can be done, 
> similar to initial delay.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to