[
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16910708#comment-16910708
]
CR Hota commented on HDFS-14090:
--------------------------------
[~hexiaoqiao] Thanks for the review. The points you raised are very valid.
In the design doc also I have mentioned that at some point we need to introduce
and look into preemption/dynamic allocation. Yes, but for Phase 1 the current
patch will help installations move forward with the concept of isolation.
Dynamic/Preemption will obviously be a separate implementation of
{{FairnessPolicyController}}. I will open a ticket to track this next phase.
This would also need a through design analysis and review.
Lets wait for [~elgoiri] [~brahmareddy] [~aajisaka] [~xkrogen] to review the
010 patch.
> RBF: Improved isolation for downstream name nodes.
> --------------------------------------------------
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Reporter: CR Hota
> Assignee: CR Hota
> Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch,
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch,
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch,
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch,
> HDFS-14090.009.patch, HDFS-14090.010.patch, RBF_ Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should
> help minimize impact of clients connecting to healthy clusters vs unhealthy
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is
> heavily loaded with calls spiking rpc queue times, due to back pressure the
> same with start reflecting on the router. As a result of this, clients
> connecting to healthy/faster name nodes will also slow down as same rpc queue
> is maintained for all calls at the router layer. Essentially the same IPC
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we
> can change the architecture and add some throttling logic for
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify
> downstream name node and maintain a separate queue for each underlying name
> node. Another simpler way is to maintain some sort of rate limiter configured
> for each name node and let routers drop/reject/send error requests after
> certain threshold.
> This won’t be a simple change as router’s ‘Server’ layer would need redesign
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]