[ 
https://issues.apache.org/jira/browse/HDFS-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332129#comment-15332129
 ] 

Zhe Zhang commented on HDFS-10467:
----------------------------------

Read more about the design and patch. Looks really interesting. Great work here 
[~elgoiri]. Below are some questions and comments:

# Have you considered and compared with the option where client first checks 
with Router to get NN address, before doing actual RPCs? Or client directly 
checks the mount table at StateStore, to get NN address? 
# Have you considered using hard-linking on DNs for rebalancing?
# Just to clarify, in the posted design and patch, is Subcluster Rebalancer a 
tool that should always be manually started? Or is some form of automatic 
rebalancing in scope? In the patch, which component / class contains the logic 
of Rebalancer? The {{Rebalancer}} interface doesn't look like it.
# bq. We may also find (4) scenarios where too much load or high space 
requirements in a subcluster start to interfere with the primary tenants of the 
subcluster.
What are "primary tenants" in this context? Non-Hadoop workloads running on the 
same physical nodes?
# Locking the mount entry during rebalancing sounds too disruptive to 
applications. Alternatively, we can abort the rebalancing when there is an 
incoming write? Coupled with the 5.2.1 Precondition, the chances of aborted 
rebalancing shouldn't be too high.
# Locking a mount point is a little tricky. Technically, an HDFS client has 
full control on the local config, and can be configured to directly talk to the 
NN of a subtree. In a production environment, this could be a legacy config 
file, or temporary workaround to bypass router. This could lead to data 
corruption. Not sure if we should consider adding subtree locking in HDFS (any 
previous discussions / JIRAs)?
# About the rebalancing protocol in 5.6:
#* In step 1, can we simplify it by adding a limit that at most 1 rebalancing 
effort at any given time? So a new rebalancing effort would have to wait for 
the current rebalancing to either finish or be aborted.
#* Steps 7 and 9 involves waiting for _all_ routers to acknowledge some state 
change. Is this too heavyweight? Who maintains router memberships? Are we doing 
this because of router caching of mount table data?

> Router-based HDFS federation
> ----------------------------
>
>                 Key: HDFS-10467
>                 URL: https://issues.apache.org/jira/browse/HDFS-10467
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 2.7.2
>            Reporter: Inigo Goiri
>         Attachments: HDFS Router Federation.pdf, HDFS-10467.PoC.patch, 
> HDFS-Router-Federation-Prototype.patch
>
>
> Add a Router to provide a federated view of multiple HDFS clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to