[ 
https://issues.apache.org/jira/browse/HDFS-11493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15980130#comment-15980130
 ] 

Anu Engineer edited comment on HDFS-11493 at 4/22/17 8:58 PM:
--------------------------------------------------------------

[~cheersyang] Thanks for taking time out to look at the doc and ask all 
pertinent questions. Please see replies inline. Looking forward to more 
comments and review comments.

bq. Right now node state only has HEALTHY, STALE, DEAD, UNKNOWN. Is it useful 
to add following states as well?

That is the plan. As we add ability in SCM to handle maintenance mode and 
Decommissioning support, we will add those states into node state.

bq.  nodes container report will arrive scm in a wave? Would that cause network 
problem?

Very good question. This is something that I have struggled a bit about . Here 
is how I have modeled  it. Assume that we have a typical machine with 12 disks 
with 5 TB each. That would give us a node capacity of *60 TB*. Assuming that 
average size of container is 5 GB -- we have *12,000* closed containers in a 
node. Now assume that each container takes around 128 bytes to report its 
state, then we have little less than 2 MB per full container report from a node.

Today, a block report is many times larger, so I am hoping that we can keep 10 
pools running at a time, that is 240 nodes can send in its reports. It is not 
that they are all going to send the report  at the same time, but in a 5 minute 
window we need to process 480 MB of network traffic. That is around 1.6 MBps, I 
am assuming that SCM machine will be able to process this In terms of CPU, 
network and memory bandwidth. We will also support the maximum concurrency in 
pool processing, so SCM can be run on machine that is less powerful if needed.

bq. Can we move configuration properties
Sure, will do when I update the patch.

bq. More specifically, if a thread wants to add a command for datanode A, and 
another thread wants to add a command for datanode B, we probably don't want 
them to wait for the othe

It is a very good suggestion. Can I leave a TODO now, I would like to add that 
kind of locking once we get a chance to profile this code path and if we find 
that threads are waiting on this lock, we certainly should do it. It is easy to 
do also.

bq.This is a in-memory queue, how to make sure not to run into inconsistent 
state? Imagine if replication manager has just processed container reports from 
a pool and ask a datanode to replicate a container, assume the replication is 
happening in progress. And then scm crashed and restarted, how scm gets to know 
what current state is ?

You are  right, SCM crashing is a problem. Current thinking is to create an HA 
solution using Apache Ratis. Once we have a single instance SCM we will get SCM 
to support Ratis. I am told that Ratis is planning to support a distributed map 
class, which will look like a normal map for all purposes, but internally will 
replicate using Raft protocol.

bq. The queue seems to be time ordered, I think it will be better to support 
priority as well. Commands may have different priority, for example, replicate 
a container priority is usually higher than delete a container replica; 
replicate a container also may have different priorities according to the 
number of replicas in desire.

Good catch, I had not thought about it. Right now, the architecture is to send 
all commands in the next heartbeat to the node. Since the SCM typically sends 
all queued commands ( assumption here is that we will not have thousands of 
commands for a node) to a node when it sees a HB.

We might want to create a priority of processing these commands --  specified 
by SCM so datanodes can get to important tasks first. I would like to leave a 
TODO and JIRA for this for time being.


was (Author: anu):
[~cheersyang] Thanks for taking time out to look at the doc and ask all 
pertinent questions. Please see replies inline. Looking forward to more 
comments and review comments.

bq. Right now node state only has HEALTHY, STALE, DEAD, UNKNOWN. Is it useful 
to add following states as well?

That is the plan. As we add ability in SCM to handle maintenance mode and 
Decommissioning support, we will add those states into node state.

bq.  nodes container report will arrive scm in a wave? Would that cause network 
problem?

Very good question. This is something that I have struggled a bit about . Here 
is how I have modeled  it. Assume that we have a typical machine with 12 disks 
with 5 TB each. That would give us a node capacity of *60 TB*. Assuming that 
our average size of container is 5 GB -- we have *12,000* closed containers in 
a node. Now assume that each container takes around 128 bytes to report its 
state, then we have little less than 2 MB per full container report from a node.

Today, a block report is many times larger, so I am hoping that we can keep 10 
pools running at a time, that is 240 nodes can send in its reports. It is not 
that they are all going to send the report  at the same time, but in a 5 minute 
window we need to process 480 MB of network traffic. That is around 1.6 MBps, I 
am assuming that SCM machine will be able to process this In terms of CPU, 
network and memory bandwidth. We will also support the maximum concurrency in 
pool processing, so SCM can be run on machine that is less powerful if needed.

bq. Can we move configuration properties
Sure, will do when I update the patch.

bq. More specifically, if a thread wants to add a command for datanode A, and 
another thread wants to add a command for datanode B, we probably don't want 
them to wait for the othe

It is a very good suggestion. Can I leave a TODO now, I would like to add that 
kind of locking once we get a chance to profile this code path and if we find 
that threads are waiting on this lock, we certainly should do it. It is easy to 
do also.

bq.This is a in-memory queue, how to make sure not to run into inconsistent 
state? Imagine if replication manager has just processed container reports from 
a pool and ask a datanode to replicate a container, assume the replication is 
happening in progress. And then scm crashed and restarted, how scm gets to know 
what current state is ?

You are  right, SCM crashing is a problem. Current thinking is to create an HA 
solution using Apache Ratis. Once we have a single instance SCM we will get SCM 
to support Ratis. I am told that Ratis is planning to support a distributed map 
class, which will look like a normal map for all purposes, but internally will 
replicate using Raft protocol.

bq. The queue seems to be time ordered, I think it will be better to support 
priority as well. Commands may have different priority, for example, replicate 
a container priority is usually higher than delete a container replica; 
replicate a container also may have different priorities according to the 
number of replicas in desire.

Good catch, I had not thought about it. Right now, the architecture is to send 
all commands in the next heartbeat to the node. Since the SCM typically sends 
all queued commands ( assumption here is that we will not have thousands of 
commands for a node) to a node when it sees a HB.

We might want to create a priority of processing these commands --  specified 
by SCM so datanodes can get to important tasks first. I would like to leave a 
TODO and JIRA for this for time being.

> Ozone: SCM:  Add the ability to handle container reports 
> ---------------------------------------------------------
>
>                 Key: HDFS-11493
>                 URL: https://issues.apache.org/jira/browse/HDFS-11493
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ozone
>    Affects Versions: HDFS-7240
>            Reporter: Anu Engineer
>            Assignee: Anu Engineer
>         Attachments: container-replication-storage.pdf, 
> HDFS-11493-HDFS-7240.001.patch
>
>
> Once a datanode sends the container report it is SCM's responsibility to 
> determine if the replication levels are acceptable. If it is not, SCM should 
> initiate a replication request to another datanode. This JIRA tracks how SCM  
> handles a container report.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to