[ 
https://issues.apache.org/jira/browse/HDDS-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16533518#comment-16533518
 ] 

Elek, Marton commented on HDDS-199:
-----------------------------------


Thanks a lot Ajay the careful and thorough review. I fixed all the problems, 
with some comments (see below)

{quote}Move ReplicateContainerCommand, ReplicateCommandWatcher, 
ReplicationManager from org.apache.hadoop.hdds.scm.container to 
org.apache.hadoop.ozone.container.replication or 
org.apache.hadoop.hdds.container.replication{quote}

Done. Moved to org.apache.hadoop.hdds.scm.container.replication together with 
ReplicationQueue (the existing classes under scm contains the scm in the 
package name. I keep it.)

{quote} Rename suggestion: ReplicateCommandWatcher to 
ReplicationCommandWatcher{quote}

Done.
 
{quote}ReplicateContainerCommand: L65-67: Probably move this to some Pb-util 
class. We might have to do this conversion in other places as well.{quote}

Happy to move it but don't know what would be to good place for that. I just 
used the same approach which was used by all the other commands. IMHO in case 
of a change all the other commands should be changed as well.

{quote}ReplicateContainerCommand: L75-L79: Use of stream might be less 
efficient than traditional approach specially since list size is pretty 
small.{quote}

Done.

{quote}SCMCommonPolicy: Since we are not doing anything with excluded nodes for 
time being we should add a TODO comment and may be add an Jira to handle it 
later.{quote}

Thanks, it was an error. I fixed only the SCMContainerPlacementRandom.java and 
not the SCMCommonPolicy.java. Instead of todo, now it should be handled.  

{quote} ReplicationQueue:  L65: Update documentation for take as it will not 
return null anymore.{quote}

Done, thx.

{quote} ReplicationQueue: L37,L45,L55,L65,L69: We should synchronize 
peek/remove and add operation. Currently our ReplicationManager seems to be 
single threaded but that may change.{quote}

According to the javadoc BlockingQueues are thread safe: " implementations are 
thread-safe. All queuing methods achieve their effects atomically using 
internal locks or other forms of concurrency control".  I think it's enough 
guarantee, but let me know if not.

{quote} ReplicationManager: L165: With HDDS-175 we will not get pipeline from 
containerInfo. {quote}

That's a very hard question. IMHO there is no easy way to get the current 
datanodes after HDDS-175, as there is no container -> datanode[] mapping for 
the closed containers. Do you know where this information available after 
HDDS-175? (I rebased the patch but can't return with

{quote}ReplicationManager: rename suggestion; containerStateMap to 
containerStateMgr. To avoid any confusion for ContainerStateManager and 
ContainerStateMap{quote}

Done. thanks to catch it.

{quote}ReplicationManager: getUUID returns null{quote}

Done, same.

{quote} ScmConfigKeys: add default value for HDDS_SCM_WATCHER_TIMEOUT (i.e 
HDDS_SCM_WATCHER_TIMEOUT_DEFAULT){quote}

Done. (BTW, the api call in SCM was wrong because the unit was converted to 
Minutes instead of MS.)

> Implement ReplicationManager to replicate ClosedContainers
> ----------------------------------------------------------
>
>                 Key: HDDS-199
>                 URL: https://issues.apache.org/jira/browse/HDDS-199
>             Project: Hadoop Distributed Data Store
>          Issue Type: Improvement
>          Components: SCM
>            Reporter: Elek, Marton
>            Assignee: Elek, Marton
>            Priority: Major
>             Fix For: 0.2.1
>
>         Attachments: HDDS-199.001.patch, HDDS-199.002.patch
>
>
> HDDS/Ozone supports Open and Closed containers. In case of specific 
> conditions (container is full, node is failed) the container will be closed 
> and will be replicated in a different way. The replication of Open containers 
> are handled with Ratis and PipelineManger.
> The ReplicationManager should handle the replication of the ClosedContainers. 
> The replication information will be sent as an event 
> (UnderReplicated/OverReplicated). 
> The Replication manager will collect all of the events in a priority queue 
> (to replicate first the containers where more replica is missing) calculate 
> the destination datanode (first with a very simple algorithm, later with 
> calculating scatter-width) and send the Copy/Delete container to the datanode 
> (CommandQueue).
> A CopyCommandWatcher/DeleteCommandWatcher are also included to retry the 
> copy/delete in case of failure. This is an in-memory structure (based on 
> HDDS-195) which can requeue the underreplicated/overreplicated events to the 
> prioirity queue unless the confirmation of the copy/delete command is arrived.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to