[
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855159#comment-16855159
]
CR Hota commented on HDFS-14090:
--------------------------------
[~xkrogen] Thanks for taking a look.
Yes, overall idea am looking at is a self-healing system, under heavy load
requests will automatically balance out across all routers in a cluster. There
is no way, incoming requests can be segregated out based on nameservices as
mentioned in the design. The only way to handle this is to make sure, handlers
flush requests soon which is what they will do when permits are not available.
With respect to exception, StandbyException is the only one that will allow
clients to connect to a different router thus distributing load and auto
balancing the whole system. This would need no client change. Backoff
exceptions won't help clients try another router which might have handlers
available.
We can always add more finer control once v1 is done, such as throw different
kind of exceptions under different scenarios, preemption. I have briefly
mentioned them in the design.
> RBF: Improved isolation for downstream name nodes.
> --------------------------------------------------
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Reporter: CR Hota
> Assignee: CR Hota
> Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, RBF_ Isolation
> design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should
> help minimize impact of clients connecting to healthy clusters vs unhealthy
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is
> heavily loaded with calls spiking rpc queue times, due to back pressure the
> same with start reflecting on the router. As a result of this, clients
> connecting to healthy/faster name nodes will also slow down as same rpc queue
> is maintained for all calls at the router layer. Essentially the same IPC
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we
> can change the architecture and add some throttling logic for
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify
> downstream name node and maintain a separate queue for each underlying name
> node. Another simpler way is to maintain some sort of rate limiter configured
> for each name node and let routers drop/reject/send error requests after
> certain threshold.
> This won’t be a simple change as router’s ‘Server’ layer would need redesign
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]