[jira] [Comment Edited] (HDDS-199) Implement ReplicationManager to replicate ClosedContainers

2018-07-13 Thread Ajay Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543511#comment-16543511
 ] 

Ajay Kumar edited comment on HDDS-199 at 7/13/18 7:47 PM:
--

[~elek] I think replication completion event will be published by 
CommandStatsHandler.   [HDDS-256]


was (Author: ajayydv):
[~elek] i think this jira has a dependency at [HDDS-256]

> Implement ReplicationManager to replicate ClosedContainers
> --
>
> Key: HDDS-199
> URL: https://issues.apache.org/jira/browse/HDDS-199
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: SCM
>Reporter: Elek, Marton
>Assignee: Elek, Marton
>Priority: Major
> Fix For: 0.2.1
>
> Attachments: HDDS-199.001.patch, HDDS-199.002.patch, 
> HDDS-199.003.patch, HDDS-199.004.patch, HDDS-199.005.patch, 
> HDDS-199.006.patch, HDDS-199.007.patch, HDDS-199.008.patch, 
> HDDS-199.009.patch, HDDS-199.010.patch
>
>
> HDDS/Ozone supports Open and Closed containers. In case of specific 
> conditions (container is full, node is failed) the container will be closed 
> and will be replicated in a different way. The replication of Open containers 
> are handled with Ratis and PipelineManger.
> The ReplicationManager should handle the replication of the ClosedContainers. 
> The replication information will be sent as an event 
> (UnderReplicated/OverReplicated). 
> The Replication manager will collect all of the events in a priority queue 
> (to replicate first the containers where more replica is missing) calculate 
> the destination datanode (first with a very simple algorithm, later with 
> calculating scatter-width) and send the Copy/Delete container to the datanode 
> (CommandQueue).
> A CopyCommandWatcher/DeleteCommandWatcher are also included to retry the 
> copy/delete in case of failure. This is an in-memory structure (based on 
> HDDS-195) which can requeue the underreplicated/overreplicated events to the 
> prioirity queue unless the confirmation of the copy/delete command is arrived.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-199) Implement ReplicationManager to replicate ClosedContainers

2018-07-13 Thread Ajay Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543511#comment-16543511
 ] 

Ajay Kumar edited comment on HDDS-199 at 7/13/18 5:50 PM:
--

[~elek] i think this jira has a dependency at [HDDS-256]


was (Author: ajayydv):
[~elek] i think this jira has a dependency at [HDDS-234]

> Implement ReplicationManager to replicate ClosedContainers
> --
>
> Key: HDDS-199
> URL: https://issues.apache.org/jira/browse/HDDS-199
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: SCM
>Reporter: Elek, Marton
>Assignee: Elek, Marton
>Priority: Major
> Fix For: 0.2.1
>
> Attachments: HDDS-199.001.patch, HDDS-199.002.patch, 
> HDDS-199.003.patch, HDDS-199.004.patch, HDDS-199.005.patch, 
> HDDS-199.006.patch, HDDS-199.007.patch, HDDS-199.008.patch, 
> HDDS-199.009.patch, HDDS-199.010.patch
>
>
> HDDS/Ozone supports Open and Closed containers. In case of specific 
> conditions (container is full, node is failed) the container will be closed 
> and will be replicated in a different way. The replication of Open containers 
> are handled with Ratis and PipelineManger.
> The ReplicationManager should handle the replication of the ClosedContainers. 
> The replication information will be sent as an event 
> (UnderReplicated/OverReplicated). 
> The Replication manager will collect all of the events in a priority queue 
> (to replicate first the containers where more replica is missing) calculate 
> the destination datanode (first with a very simple algorithm, later with 
> calculating scatter-width) and send the Copy/Delete container to the datanode 
> (CommandQueue).
> A CopyCommandWatcher/DeleteCommandWatcher are also included to retry the 
> copy/delete in case of failure. This is an in-memory structure (based on 
> HDDS-195) which can requeue the underreplicated/overreplicated events to the 
> prioirity queue unless the confirmation of the copy/delete command is arrived.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-199) Implement ReplicationManager to replicate ClosedContainers

2018-07-09 Thread Elek, Marton (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536780#comment-16536780
 ] 

Elek, Marton edited comment on HDDS-199 at 7/9/18 10:54 AM:


Thanks [~ajayydv] the additional comments.

1. I started to refactor it to use ExecutorService after your comment but it 
become more complex for me. ExecutorServices is good for handling multiple 
smaller tasks (executorService.submit), but in our case we have one 
long-running thread with only one task. I think it's more clear to use just a 
thread.

2. By default ReplicationManager receives events only for closed containers. 
But you are right, it's better to check it. I added a PreconditionCheck to 
check the state of the container state (as there is a try catch block inside 
the main loop, it will be printed out and the loop will continue).

3. SCMCommonPolicy unit tests: To be honest, I also considered to modify the 
unit test. The only problem is that there is no unit tests for the policies. 
There is a higher level test (TestContainerPlacement) which checks the 
distributions of the containers. But you are right, and your comment convinced 
me. I created two brand new unit tests for the two placement implementations 
which include the check of the exclude list.

4. Other nits are fixed. Except the UUID: We can't use the UUID of the original 
replication request as there is a one-to-many relationship between the original 
replication event and the new tracking events: if multiple replicas are 
missing, we create multiple DatanodeCommand and we need to track them 
one-by-one. Therefore we need different UUIDs. But thanks to point to it: in 
that case we don't need the getUUID in the original  ReplicationRequest event 
as it could not been used.

Latest patch has been uploaded with all these fixes + new unit tests.


was (Author: elek):
Thanks [~ajayydv] the additional comments.

1. I started to refactor it to us ExecutorService after your comment but it 
become more complex for me. ExecutorServices is good for handling multiple 
smaller tasks (executorService.submit), but in our case we have one 
long-running thread with only one task. I think it's more clear to use just a 
thread.

2. By default ReplicationManager receives events only for closed containers. 
But you are right, it's better to check it. I added a PreconditionCheck to 
check the state of the container state (as there is a try catch block inside 
the main loop, it will be printed out and the loop will continue).

3. SCMCommonPolicy unit tests: To be honest, I also considered to modify the 
unit test. The only problem is that there is no unit tests for the policies. 
There is a higher level test (TestContainerPlacement) which checks the 
distributions of the containers. But you are right, and your comment convinced 
me. I created two brand new unit tests for the two placement implementation 
which includes the check of the exclude list.

4. Other nits are fixed. Except the UUID: We can't use the UUID of the original 
replication request as there is a one-to-many relationship between the original 
replication event and the new tracking events: if multiple replicas are 
missing, we create multiple DatanodeCommand and we need to track them 
one-by-one. Therefore we need different UUIDs. But thanks to point to it: in 
that case we don't need the getUUID in the original  ReplicationRequest event 
as it could not been used.

Latest patch has been uploaded with all these fixess + new unit tests.

> Implement ReplicationManager to replicate ClosedContainers
> --
>
> Key: HDDS-199
> URL: https://issues.apache.org/jira/browse/HDDS-199
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: SCM
>Reporter: Elek, Marton
>Assignee: Elek, Marton
>Priority: Major
> Fix For: 0.2.1
>
> Attachments: HDDS-199.001.patch, HDDS-199.002.patch, 
> HDDS-199.003.patch, HDDS-199.004.patch, HDDS-199.005.patch
>
>
> HDDS/Ozone supports Open and Closed containers. In case of specific 
> conditions (container is full, node is failed) the container will be closed 
> and will be replicated in a different way. The replication of Open containers 
> are handled with Ratis and PipelineManger.
> The ReplicationManager should handle the replication of the ClosedContainers. 
> The replication information will be sent as an event 
> (UnderReplicated/OverReplicated). 
> The Replication manager will collect all of the events in a priority queue 
> (to replicate first the containers where more replica is missing) calculate 
> the destination datanode (first with a very simple algorithm, later with 
> calculating scatter-width) and send the Copy/Delete container to the datanode 
> (CommandQueue).
> A 

[jira] [Comment Edited] (HDDS-199) Implement ReplicationManager to replicate ClosedContainers

2018-07-06 Thread Ajay Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535207#comment-16535207
 ] 

Ajay Kumar edited comment on HDDS-199 at 7/6/18 6:34 PM:
-

[~elek] thanks for updating the patch. On a second look at ReplicationManager i 
thought of having a ExecutorPool inside it whose size is configuration driven. 
(Instead of it being a runnable thread). Its default size may be 1 but it will 
give us flexibility to dial it up if required. Not sure if this is an overkill 
as single thread might be sufficient to handle all replica related work even in 
busy big cluster. Any thought on this?
{quote}That's a very hard question. IMHO there is no easy way to get the 
current datanodes after HDDS-175, as there is no container -> datanode[] 
mapping for the closed containers. Do you know where this information available 
after HDDS-175? (I rebased the patch but can't return with{quote}
[HDDS-228] should give us the means to find out the replicas of given 
container.  [~anu], [~nandakumar131] We might have to check that we are not 
adding any replication request for RATIS, open containers. This can be done 
either by ContainerReportHandler or ReplicationManager.
{quote} fixed only the SCMContainerPlacementRandom.java and not the 
SCMCommonPolicy.java. Instead of todo, now it should be handled.{quote}
Shall we add a test case to validate excluded nodes are not returned?

Few more nits:
 * ReplicationManager
 ** L81: pipelineSelector can be removed.
 ** L200 ReplicationRequestToRepeat constructor takes UUID as parameter, can't 
we use ReplicationRequest uuid. (i.e we can remove extra paramter and field and 
have a API to return ReplicationRequest#getUUID)
 ** javadoc for class ReplicationRequestToRepeat



was (Author: ajayydv):
[~elek] thanks for updating the patch. On a second look at ReplicationManager i 
thought of having a ExecutorPool inside it whose size is configuration driven. 
(Instead of it being a runnable thread). Its default size may be 1 but it will 
give us flexibility to dial it up if required. Not sure if this is an overkill 
as single thread might be sufficient to handle all replica related work even in 
busy big cluster. Any thought on this?

Few more nits:
 * ReplicationManager
 ** L81: pipelineSelector can be removed.
 ** L200 ReplicationRequestToRepeat constructor takes UUID as parameter, can't 
we use ReplicationRequest uuid. (i.e we can remove extra paramter and field and 
have a API to return ReplicationRequest#getUUID)
 ** javadoc for class ReplicationRequestToRepeat
{quote}That's a very hard question. IMHO there is no easy way to get the 
current datanodes after HDDS-175, as there is no container -> datanode[] 
mapping for the closed containers. Do you know where this information available 
after HDDS-175? (I rebased the patch but can't return with{quote}
[HDDS-228] should give us the means to find out the replicas of given 
container. We might have to check that we are not adding any replication 
request for RATIS, open containers.
{quote} fixed only the SCMContainerPlacementRandom.java and not the 
SCMCommonPolicy.java. Instead of todo, now it should be handled.{quote}
Shall we add a test case to validate excluded nodes are not returned?


> Implement ReplicationManager to replicate ClosedContainers
> --
>
> Key: HDDS-199
> URL: https://issues.apache.org/jira/browse/HDDS-199
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: SCM
>Reporter: Elek, Marton
>Assignee: Elek, Marton
>Priority: Major
> Fix For: 0.2.1
>
> Attachments: HDDS-199.001.patch, HDDS-199.002.patch, 
> HDDS-199.003.patch, HDDS-199.004.patch
>
>
> HDDS/Ozone supports Open and Closed containers. In case of specific 
> conditions (container is full, node is failed) the container will be closed 
> and will be replicated in a different way. The replication of Open containers 
> are handled with Ratis and PipelineManger.
> The ReplicationManager should handle the replication of the ClosedContainers. 
> The replication information will be sent as an event 
> (UnderReplicated/OverReplicated). 
> The Replication manager will collect all of the events in a priority queue 
> (to replicate first the containers where more replica is missing) calculate 
> the destination datanode (first with a very simple algorithm, later with 
> calculating scatter-width) and send the Copy/Delete container to the datanode 
> (CommandQueue).
> A CopyCommandWatcher/DeleteCommandWatcher are also included to retry the 
> copy/delete in case of failure. This is an in-memory structure (based on 
> HDDS-195) which can requeue the underreplicated/overreplicated events to the 
> prioirity queue unless the confirmation of the copy/delete 

[jira] [Comment Edited] (HDDS-199) Implement ReplicationManager to replicate ClosedContainers

2018-07-02 Thread Ajay Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530587#comment-16530587
 ] 

Ajay Kumar edited comment on HDDS-199 at 7/3/18 1:05 AM:
-

[~elek] thanks for working in this. Few suggestions:
 * Move ReplicateContainerCommand, ReplicateCommandWatcher, ReplicationManager 
from {{org.apache.hadoop.hdds.scm.container}} to 
{{org.apache.hadoop.ozone.container.replication}} or 
{{org.apache.hadoop.hdds.container.replication}}
 * Rename suggestion: {{ReplicateCommandWatcher}} to 
{{ReplicationCommandWatcher}} 
 * ReplicateContainerCommand:
 ** L65-67: Probably move this to some Pb-util class. We might have to do this 
conversion in other places as well.
 ** L75-L79: Use of stream might be less efficient than traditional approach 
specially since list size is pretty small. 
 * SCMCommonPolicy
 ** Since we are not doing anything with excluded nodes for time being we 
should add a TODO comment and may be add an Jira to handle it later.

 * ReplicationQueue: 
 ** L65: Update documentation for take as it will not return null anymore.
 ** L37,L45,L55,L65,L69: We should synchronize peek/remove and add operation. 
Currently our ReplicationManager seems to be single threaded but that may 
change.
 * ReplicationRequest: 
 ** L65: 
 * ReplicationManager:
 ** L165: With HDDS-175 we will not get pipeline from containerInfo. 
 ** L75: rename suggestion; containerStateMap to containerStateMgr. To avoid 
any confusion for ContainerStateManager and ContainerStateMap
 ** L220: getUUID returns null
* ScmConfigKeys: add default value for {{HDDS_SCM_WATCHER_TIMEOUT}} (i.e 
HDDS_SCM_WATCHER_TIMEOUT_DEFAULT)
  


was (Author: ajayydv):
[~elek] thanks for working in this. Few suggestions:
 * Move ReplicateContainerCommand, ReplicateCommandWatcher, ReplicationManager 
from {{org.apache.hadoop.hdds.scm.container}} to 
{{org.apache.hadoop.ozone.container.replication}}
 * Rename suggestion: {{ReplicateCommandWatcher}} to 
{{ReplicationCommandWatcher}} 
 * ReplicateContainerCommand:
 ** L65-67: Probably move this to some Pb-util class. We might have to do this 
conversion in other places as well.
 ** L75-L79: Use of stream might be less efficient than traditional approach 
specially since list size is pretty small. 
 * SCMCommonPolicy
 ** Since we are not doing anything with excluded nodes for time being we 
should add a TODO comment and may be add an Jira to handle it later.

 * ReplicationQueue: 
 ** L65: Update documentation for take as it will not return null anymore.
 ** L37,L45,L55,L65,L69: We should synchronize peek/remove and add operation. 
Currently our ReplicationManager seems to be single threaded but that may 
change.
 * ReplicationRequest: 
 ** L65: 
 * ReplicationManager:
 ** L165: With HDDS-175 we will not get pipeline from containerInfo. 
 ** L75: rename suggestion; containerStateMap to containerStateMgr. To avoid 
any confusion for ContainerStateManager and ContainerStateMap
 ** L220: getUUID returns null
* ScmConfigKeys: add default value for {{HDDS_SCM_WATCHER_TIMEOUT}} (i.e 
HDDS_SCM_WATCHER_TIMEOUT_DEFAULT)
  

> Implement ReplicationManager to replicate ClosedContainers
> --
>
> Key: HDDS-199
> URL: https://issues.apache.org/jira/browse/HDDS-199
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: SCM
>Reporter: Elek, Marton
>Assignee: Elek, Marton
>Priority: Major
> Fix For: 0.2.1
>
> Attachments: HDDS-199.001.patch, HDDS-199.002.patch
>
>
> HDDS/Ozone supports Open and Closed containers. In case of specific 
> conditions (container is full, node is failed) the container will be closed 
> and will be replicated in a different way. The replication of Open containers 
> are handled with Ratis and PipelineManger.
> The ReplicationManager should handle the replication of the ClosedContainers. 
> The replication information will be sent as an event 
> (UnderReplicated/OverReplicated). 
> The Replication manager will collect all of the events in a priority queue 
> (to replicate first the containers where more replica is missing) calculate 
> the destination datanode (first with a very simple algorithm, later with 
> calculating scatter-width) and send the Copy/Delete container to the datanode 
> (CommandQueue).
> A CopyCommandWatcher/DeleteCommandWatcher are also included to retry the 
> copy/delete in case of failure. This is an in-memory structure (based on 
> HDDS-195) which can requeue the underreplicated/overreplicated events to the 
> prioirity queue unless the confirmation of the copy/delete command is arrived.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Comment Edited] (HDDS-199) Implement ReplicationManager to replicate ClosedContainers

2018-07-02 Thread Ajay Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530587#comment-16530587
 ] 

Ajay Kumar edited comment on HDDS-199 at 7/2/18 11:47 PM:
--

[~elek] thanks for working in this. Few suggestions:
 * Move ReplicateContainerCommand, ReplicateCommandWatcher, ReplicationManager 
from {{org.apache.hadoop.hdds.scm.container}} to 
{{org.apache.hadoop.ozone.container.replication}}
 * Rename suggestion: {{ReplicateCommandWatcher}} to 
{{ReplicationCommandWatcher}} 
 * ReplicateContainerCommand:
 ** L65-67: Probably move this to some Pb-util class. We might have to do this 
conversion in other places as well.
 ** L75-L79: Use of stream might be less efficient than traditional approach 
specially since list size is pretty small. 
 * SCMCommonPolicy
 ** Since we are not doing anything with excluded nodes for time being we 
should add a TODO comment and may be add an Jira to handle it later.

 * ReplicationQueue: 
 ** L65: Update documentation for take as it will not return null anymore.
 ** L37,L45,L55,L65,L69: We should synchronize peek/remove and add operation. 
Currently our ReplicationManager seems to be single threaded but that may 
change.
 * ReplicationRequest: 
 ** L65: 
 * ReplicationManager:
 ** L165: With HDDS-175 we will not get pipeline from containerInfo. 
 ** L75: rename suggestion; containerStateMap to containerStateMgr. To avoid 
any confusion for ContainerStateManager and ContainerStateMap
 ** L220: getUUID returns null
* ScmConfigKeys: add default value for {{HDDS_SCM_WATCHER_TIMEOUT}} (i.e 
HDDS_SCM_WATCHER_TIMEOUT_DEFAULT)
  


was (Author: ajayydv):
[~elek] thanks for working in this.
 * Move ReplicateContainerCommand, ReplicateCommandWatcher, ReplicationManager 
from {{org.apache.hadoop.hdds.scm.container}} to 
{{org.apache.hadoop.ozone.container.replication}}
 * Rename suggestion: {{ReplicateCommandWatcher}} to 
{{ReplicationCommandWatcher}} 
 * ReplicateContainerCommand:
 ** L65-67: Probably move this to some Pb-util class. We might have to do this 
conversion in other places as well.
 ** L75-L79: Use of stream might be less efficient than traditional approach 
specially since list size is pretty small. 
 * SCMCommonPolicy
 ** Since we are not doing anything with excluded nodes for time being we 
should add a TODO comment and may be add an Jira to handle it later.

 * ReplicationQueue: 
 ** L65: Update documentation for take as it will not return null anymore.
 ** L37,L45,L55,L65,L69: We should synchronize peek/remove and add operation. 
Currently our ReplicationManager seems to be single threaded but that may 
change.
 * ReplicationRequest: 
 ** L65: 
 * ReplicationManager:
 ** L165: With HDDS-175 we will not get pipeline from containerInfo. 
 ** L75: rename suggestion; containerStateMap to containerStateMgr. To avoid 
any confusion for ContainerStateManager and ContainerStateMap
 ** L220: getUUID returns null
* ScmConfigKeys: add default value for {{HDDS_SCM_WATCHER_TIMEOUT}} (i.e 
HDDS_SCM_WATCHER_TIMEOUT_DEFAULT)
  

> Implement ReplicationManager to replicate ClosedContainers
> --
>
> Key: HDDS-199
> URL: https://issues.apache.org/jira/browse/HDDS-199
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: SCM
>Reporter: Elek, Marton
>Assignee: Elek, Marton
>Priority: Major
> Fix For: 0.2.1
>
> Attachments: HDDS-199.001.patch, HDDS-199.002.patch
>
>
> HDDS/Ozone supports Open and Closed containers. In case of specific 
> conditions (container is full, node is failed) the container will be closed 
> and will be replicated in a different way. The replication of Open containers 
> are handled with Ratis and PipelineManger.
> The ReplicationManager should handle the replication of the ClosedContainers. 
> The replication information will be sent as an event 
> (UnderReplicated/OverReplicated). 
> The Replication manager will collect all of the events in a priority queue 
> (to replicate first the containers where more replica is missing) calculate 
> the destination datanode (first with a very simple algorithm, later with 
> calculating scatter-width) and send the Copy/Delete container to the datanode 
> (CommandQueue).
> A CopyCommandWatcher/DeleteCommandWatcher are also included to retry the 
> copy/delete in case of failure. This is an in-memory structure (based on 
> HDDS-195) which can requeue the underreplicated/overreplicated events to the 
> prioirity queue unless the confirmation of the copy/delete command is arrived.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: