[ 
https://issues.apache.org/jira/browse/HDDS-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4599:
----------------------------
    Description: 
ReplicationManager maintains the in-flight replication and deletion in-memory, 
which is not replicated using Ratis. So, theoretically it’s possible that we 
might run into data loss issues and over replicated issues if we immediately 
start ReplicationManager after a failover.

There is a quick fix for the potential data loss issue HDDS-4589, however we 
need a thorough solution for both in-flight add and in-flight delete.

We have two proposals from [~sodonnell]:
 # have the DNs provide a list of pending_delete blocks in their container 
report / heartbeat, and then we can use that in SCM.
 # if the DNs detect a new master SCM or a restarted SCM, then purge their 
pending delete list and wait for new instructions from the new/restarted SCM.

File this Jira to record this problem.

  was:
ReplicationManager maintains the in-flight replication and deletion in-memory, 
which is not replicated using Ratis. So, theoretically it’s possible that we 
might run into data loss issues and over replicated issues if we immediately 
start ReplicationManager after a failover.

There is a quick fix for the potential data loss issue HDDS-4589, however we 
need a thorough solution for both in-flight add and in-flight delete.

We have two proposals from [~sodonnell]:
 # have the DNs provide a list of pending_delete blocks in their container 
report / heartbeat, and then we can use that in SCM.
 # if the DNs detect a new master SCM or a restarted SCM, then purge their 
pending delete list and wait for new instructions from the new/restarted SCM.


> Handle inflight delete/add actions in ReplicationManager properly.
> ------------------------------------------------------------------
>
>                 Key: HDDS-4599
>                 URL: https://issues.apache.org/jira/browse/HDDS-4599
>             Project: Hadoop Distributed Data Store
>          Issue Type: Sub-task
>          Components: SCM HA
>    Affects Versions: 1.1.0
>            Reporter: Glen Geng
>            Priority: Major
>
> ReplicationManager maintains the in-flight replication and deletion 
> in-memory, which is not replicated using Ratis. So, theoretically it’s 
> possible that we might run into data loss issues and over replicated issues 
> if we immediately start ReplicationManager after a failover.
> There is a quick fix for the potential data loss issue HDDS-4589, however we 
> need a thorough solution for both in-flight add and in-flight delete.
> We have two proposals from [~sodonnell]:
>  # have the DNs provide a list of pending_delete blocks in their container 
> report / heartbeat, and then we can use that in SCM.
>  # if the DNs detect a new master SCM or a restarted SCM, then purge their 
> pending delete list and wait for new instructions from the new/restarted SCM.
> File this Jira to record this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to