[
https://issues.apache.org/jira/browse/HDFS-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332129#comment-15332129
]
Zhe Zhang commented on HDFS-10467:
----------------------------------
Read more about the design and patch. Looks really interesting. Great work here
[~elgoiri]. Below are some questions and comments:
# Have you considered and compared with the option where client first checks
with Router to get NN address, before doing actual RPCs? Or client directly
checks the mount table at StateStore, to get NN address?
# Have you considered using hard-linking on DNs for rebalancing?
# Just to clarify, in the posted design and patch, is Subcluster Rebalancer a
tool that should always be manually started? Or is some form of automatic
rebalancing in scope? In the patch, which component / class contains the logic
of Rebalancer? The {{Rebalancer}} interface doesn't look like it.
# bq. We may also find (4) scenarios where too much load or high space
requirements in a subcluster start to interfere with the primary tenants of the
subcluster.
What are "primary tenants" in this context? Non-Hadoop workloads running on the
same physical nodes?
# Locking the mount entry during rebalancing sounds too disruptive to
applications. Alternatively, we can abort the rebalancing when there is an
incoming write? Coupled with the 5.2.1 Precondition, the chances of aborted
rebalancing shouldn't be too high.
# Locking a mount point is a little tricky. Technically, an HDFS client has
full control on the local config, and can be configured to directly talk to the
NN of a subtree. In a production environment, this could be a legacy config
file, or temporary workaround to bypass router. This could lead to data
corruption. Not sure if we should consider adding subtree locking in HDFS (any
previous discussions / JIRAs)?
# About the rebalancing protocol in 5.6:
#* In step 1, can we simplify it by adding a limit that at most 1 rebalancing
effort at any given time? So a new rebalancing effort would have to wait for
the current rebalancing to either finish or be aborted.
#* Steps 7 and 9 involves waiting for _all_ routers to acknowledge some state
change. Is this too heavyweight? Who maintains router memberships? Are we doing
this because of router caching of mount table data?
> Router-based HDFS federation
> ----------------------------
>
> Key: HDFS-10467
> URL: https://issues.apache.org/jira/browse/HDFS-10467
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: fs
> Affects Versions: 2.7.2
> Reporter: Inigo Goiri
> Attachments: HDFS Router Federation.pdf, HDFS-10467.PoC.patch,
> HDFS-Router-Federation-Prototype.patch
>
>
> Add a Router to provide a federated view of multiple HDFS clusters.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]