[jira] [Commented] (HDFS-15087) RBF: Balance/Rename across federation namespaces

Jinglun (Jira) Sat, 11 Jan 2020 08:01:02 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-15087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013529#comment-17013529
 ]


Jinglun commented on HDFS-15087:
--------------------------------

Hi [~brahmareddy], thanks your nice review and comment ! 
{quote}1.Scheduler will have ablitiy to identify which NS is full and 
automatically schedule the job..? May be each NS can configure the threshhold..?
2.Scheduler can be enchanced based RPC load/usage ( balance the RPC load 
also)..?
{quote}
These are very nice and advanced features to have. But now we don't support 
automatic balance. The Scheduler only handles the life cycle of HFR jobs. It's 
not smart enough to do the automatic balance.
In our practice we only support using a DFSAdmin command to start a Scheduler 
locally and submit an HFR job to it. If the Scheduler process breaks down, we 
can use the 'continue' command to restart the Scheduler and recover the 
unfinished jobs.I think we can do this step by step. First support DFSAdmin 
command balance. Then integrate it to Router. Finally support smart balance.

 

 
{quote}3.How the consistency is ensured if you wn't use snapshot like Yiqun Lin 
and Inigo mentioned..? After "saveTree" step, before creating the mount 
table(Or blockwrites will be there till job is success, but this might delay 
other applications as you make tree as immutable?) .how the delta will be 
processed..? ( i didn't seen, making mount table readonly)..Bytheway editlog 
idea looks good here.
{quote}
The HFR(hardlink) doesn't support delta now, the whole process is 
writes-blocked. So for big path the HFR(hardlink) might cause a long time 
blockwrites. If we want a short time blockwrites we can use HFR(distcp). The 
HFR(distcp) works like:
{code:java}
hfrDistCp() {  
  long time = Long.MAX_VALUE;  
  while(time > threshold) { 
    long start = now(); 
    distcp(); 
    time = now() - start;  
  }  
  blockWrites(src);
  distcp();// final round.
  updateMountTable();
}
{code}
Each round of distcp handles the delta. When the distcp time is fast enough we 
block writes and do the final round of distcp.

Another solution could be balance the big path piece by piece using 
HFR(hardlink). Each piece is small enough(e.g. less than 10TB) to be finished 
within 1 minute.

 

When I trying to use snapshot I meet a tricky problem. I envisioned 2 ways: 
In fun1, we don't do hardlink in the loop. So the final hardlink will cost a 
lot. From the performance chapter we can see the HardLink part costs 
474,701ms(82.22%) for a large path. The larger the path is, the higer the 
hardlink proportion. So in this way the benefit of snapshot delta is not 
much(17.78%).
{code:java}
fun1() {
  saveSnapshot(src);
  graftSnapshot(dst);
  long time = Long.MAX_VALUE;
  while(time > threshold) {
    long start = now();
    saveDiff(src); graftDiff(dst);
    time = now() - start;
  }
  blockWrites(src);
  saveDiff(src); // final round.
  graftDiff(dst);
  hardlink();// final hardlink.
  updateMountTable();
}
{code}
In fun2 we do hardlink in the loop. So the final hardlink can be very quick. 
But there is also a problem. Supposing a block is appended, the length and 
generation stamp will all change. Since the blk is hardlinked, it changes the 
block in the dst-pool too. There would be inconsistent between the DN replica 
map and the block on disk. It might either be fixed by the DN, or deleted by 
the dst-NN, or reported missing blk by the dst-NN. Also each time we do 
hardlink we need to find out which blocks are changed. We must clear and 
re-hardlink them. The whole process is very complicated and may cause chain 
reaction.
{code:java}
fun2() {
  saveSnapshot(src);
  graftSnapshot(dst);
  long time = Long.MAX_VALUE;
  while(time > threshold) {
    long start = now();
    saveDiff(src); graftDiff(dst);
    hardlink();// hardlink delta.
    time = now() - start;
  }
  blockWrites(src);
  saveDiff(src); // final round.
  graftDiff(dst);
  hardlink();// final hardlink.
  updateMountTable();
}
{code}
{quote}4.Mount table properties(or attributes) also preserved..?
{quote}
Yes the mount table properties are preserved too. (It's not included in the 
initial patch)

 
{quote}5.saveTree() and GraftTree() are idempotent..? On Namenode failover, 
will these be re-executed if it's connected to newly active namenode.
{quote}
The saveTree() is idempotent and the graftTree() is not. The graftTree() is 
like "create" with overwrite=false. If the first graftTree() succeeds and the 
client retrys, the retry will fail. So in the HFR job we can't determine 
"graftTree() fail" when the rpc throwing an Exception. We need to check the 
existance of the dst-path. If the dst-path exists then the rpc succeeds. 
Otherwise the rpc fails.

> RBF: Balance/Rename across federation namespaces
> ------------------------------------------------
>
>                 Key: HDFS-15087
>                 URL: https://issues.apache.org/jira/browse/HDFS-15087
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Jinglun
>            Priority: Major
>         Attachments: HDFS-15087.initial.patch, HFR_Rename Across Federation 
> Namespaces.pdf
>
>
> The Xiaomi storage team has developed a new feature called HFR(HDFS 
> Federation Rename) that enables us to do balance/rename across federation 
> namespaces. The idea is to first move the meta to the dst NameNode and then 
> link all the replicas. It has been working in our largest production cluster 
> for 2 months. We use it to balance the namespaces. It turns out HFR is fast 
> and flexible. The detail could be found in the design doc. 
> Looking forward to a lively discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-15087) RBF: Balance/Rename across federation namespaces

Reply via email to