[ https://issues.apache.org/jira/browse/HDDS-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16533518#comment-16533518 ]
Elek, Marton commented on HDDS-199: ----------------------------------- Thanks a lot Ajay the careful and thorough review. I fixed all the problems, with some comments (see below) {quote}Move ReplicateContainerCommand, ReplicateCommandWatcher, ReplicationManager from org.apache.hadoop.hdds.scm.container to org.apache.hadoop.ozone.container.replication or org.apache.hadoop.hdds.container.replication{quote} Done. Moved to org.apache.hadoop.hdds.scm.container.replication together with ReplicationQueue (the existing classes under scm contains the scm in the package name. I keep it.) {quote} Rename suggestion: ReplicateCommandWatcher to ReplicationCommandWatcher{quote} Done. {quote}ReplicateContainerCommand: L65-67: Probably move this to some Pb-util class. We might have to do this conversion in other places as well.{quote} Happy to move it but don't know what would be to good place for that. I just used the same approach which was used by all the other commands. IMHO in case of a change all the other commands should be changed as well. {quote}ReplicateContainerCommand: L75-L79: Use of stream might be less efficient than traditional approach specially since list size is pretty small.{quote} Done. {quote}SCMCommonPolicy: Since we are not doing anything with excluded nodes for time being we should add a TODO comment and may be add an Jira to handle it later.{quote} Thanks, it was an error. I fixed only the SCMContainerPlacementRandom.java and not the SCMCommonPolicy.java. Instead of todo, now it should be handled. {quote} ReplicationQueue: L65: Update documentation for take as it will not return null anymore.{quote} Done, thx. {quote} ReplicationQueue: L37,L45,L55,L65,L69: We should synchronize peek/remove and add operation. Currently our ReplicationManager seems to be single threaded but that may change.{quote} According to the javadoc BlockingQueues are thread safe: " implementations are thread-safe. All queuing methods achieve their effects atomically using internal locks or other forms of concurrency control". I think it's enough guarantee, but let me know if not. {quote} ReplicationManager: L165: With HDDS-175 we will not get pipeline from containerInfo. {quote} That's a very hard question. IMHO there is no easy way to get the current datanodes after HDDS-175, as there is no container -> datanode[] mapping for the closed containers. Do you know where this information available after HDDS-175? (I rebased the patch but can't return with {quote}ReplicationManager: rename suggestion; containerStateMap to containerStateMgr. To avoid any confusion for ContainerStateManager and ContainerStateMap{quote} Done. thanks to catch it. {quote}ReplicationManager: getUUID returns null{quote} Done, same. {quote} ScmConfigKeys: add default value for HDDS_SCM_WATCHER_TIMEOUT (i.e HDDS_SCM_WATCHER_TIMEOUT_DEFAULT){quote} Done. (BTW, the api call in SCM was wrong because the unit was converted to Minutes instead of MS.) > Implement ReplicationManager to replicate ClosedContainers > ---------------------------------------------------------- > > Key: HDDS-199 > URL: https://issues.apache.org/jira/browse/HDDS-199 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: SCM > Reporter: Elek, Marton > Assignee: Elek, Marton > Priority: Major > Fix For: 0.2.1 > > Attachments: HDDS-199.001.patch, HDDS-199.002.patch > > > HDDS/Ozone supports Open and Closed containers. In case of specific > conditions (container is full, node is failed) the container will be closed > and will be replicated in a different way. The replication of Open containers > are handled with Ratis and PipelineManger. > The ReplicationManager should handle the replication of the ClosedContainers. > The replication information will be sent as an event > (UnderReplicated/OverReplicated). > The Replication manager will collect all of the events in a priority queue > (to replicate first the containers where more replica is missing) calculate > the destination datanode (first with a very simple algorithm, later with > calculating scatter-width) and send the Copy/Delete container to the datanode > (CommandQueue). > A CopyCommandWatcher/DeleteCommandWatcher are also included to retry the > copy/delete in case of failure. This is an in-memory structure (based on > HDDS-195) which can requeue the underreplicated/overreplicated events to the > prioirity queue unless the confirmation of the copy/delete command is arrived. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org