[
https://issues.apache.org/jira/browse/HDFS-15087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013529#comment-17013529
]
Jinglun commented on HDFS-15087:
--------------------------------
Hi [~brahmareddy], thanks your nice review and comment !
{quote}1.Scheduler will have ablitiy to identify which NS is full and
automatically schedule the job..? May be each NS can configure the threshhold..?
2.Scheduler can be enchanced based RPC load/usage ( balance the RPC load
also)..?
{quote}
These are very nice and advanced features to have. But now we don't support
automatic balance. The Scheduler only handles the life cycle of HFR jobs. It's
not smart enough to do the automatic balance.
In our practice we only support using a DFSAdmin command to start a Scheduler
locally and submit an HFR job to it. If the Scheduler process breaks down, we
can use the 'continue' command to restart the Scheduler and recover the
unfinished jobs.I think we can do this step by step. First support DFSAdmin
command balance. Then integrate it to Router. Finally support smart balance.
{quote}3.How the consistency is ensured if you wn't use snapshot like Yiqun Lin
and Inigo mentioned..? After "saveTree" step, before creating the mount
table(Or blockwrites will be there till job is success, but this might delay
other applications as you make tree as immutable?) .how the delta will be
processed..? ( i didn't seen, making mount table readonly)..Bytheway editlog
idea looks good here.
{quote}
The HFR(hardlink) doesn't support delta now, the whole process is
writes-blocked. So for big path the HFR(hardlink) might cause a long time
blockwrites. If we want a short time blockwrites we can use HFR(distcp). The
HFR(distcp) works like:
{code:java}
hfrDistCp() {
long time = Long.MAX_VALUE;
while(time > threshold) {
long start = now();
distcp();
time = now() - start;
}
blockWrites(src);
distcp();// final round.
updateMountTable();
}
{code}
Each round of distcp handles the delta. When the distcp time is fast enough we
block writes and do the final round of distcp.
Another solution could be balance the big path piece by piece using
HFR(hardlink). Each piece is small enough(e.g. less than 10TB) to be finished
within 1 minute.
When I trying to use snapshot I meet a tricky problem. I envisioned 2 ways:
In fun1, we don't do hardlink in the loop. So the final hardlink will cost a
lot. From the performance chapter we can see the HardLink part costs
474,701ms(82.22%) for a large path. The larger the path is, the higer the
hardlink proportion. So in this way the benefit of snapshot delta is not
much(17.78%).
{code:java}
fun1() {
saveSnapshot(src);
graftSnapshot(dst);
long time = Long.MAX_VALUE;
while(time > threshold) {
long start = now();
saveDiff(src); graftDiff(dst);
time = now() - start;
}
blockWrites(src);
saveDiff(src); // final round.
graftDiff(dst);
hardlink();// final hardlink.
updateMountTable();
}
{code}
In fun2 we do hardlink in the loop. So the final hardlink can be very quick.
But there is also a problem. Supposing a block is appended, the length and
generation stamp will all change. Since the blk is hardlinked, it changes the
block in the dst-pool too. There would be inconsistent between the DN replica
map and the block on disk. It might either be fixed by the DN, or deleted by
the dst-NN, or reported missing blk by the dst-NN. Also each time we do
hardlink we need to find out which blocks are changed. We must clear and
re-hardlink them. The whole process is very complicated and may cause chain
reaction.
{code:java}
fun2() {
saveSnapshot(src);
graftSnapshot(dst);
long time = Long.MAX_VALUE;
while(time > threshold) {
long start = now();
saveDiff(src); graftDiff(dst);
hardlink();// hardlink delta.
time = now() - start;
}
blockWrites(src);
saveDiff(src); // final round.
graftDiff(dst);
hardlink();// final hardlink.
updateMountTable();
}
{code}
{quote}4.Mount table properties(or attributes) also preserved..?
{quote}
Yes the mount table properties are preserved too. (It's not included in the
initial patch)
{quote}5.saveTree() and GraftTree() are idempotent..? On Namenode failover,
will these be re-executed if it's connected to newly active namenode.
{quote}
The saveTree() is idempotent and the graftTree() is not. The graftTree() is
like "create" with overwrite=false. If the first graftTree() succeeds and the
client retrys, the retry will fail. So in the HFR job we can't determine
"graftTree() fail" when the rpc throwing an Exception. We need to check the
existance of the dst-path. If the dst-path exists then the rpc succeeds.
Otherwise the rpc fails.
> RBF: Balance/Rename across federation namespaces
> ------------------------------------------------
>
> Key: HDFS-15087
> URL: https://issues.apache.org/jira/browse/HDFS-15087
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Jinglun
> Priority: Major
> Attachments: HDFS-15087.initial.patch, HFR_Rename Across Federation
> Namespaces.pdf
>
>
> The Xiaomi storage team has developed a new feature called HFR(HDFS
> Federation Rename) that enables us to do balance/rename across federation
> namespaces. The idea is to first move the meta to the dst NameNode and then
> link all the replicas. It has been working in our largest production cluster
> for 2 months. We use it to balance the namespaces. It turns out HFR is fast
> and flexible. The detail could be found in the design doc.
> Looking forward to a lively discussion.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]