[
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944029#comment-16944029
]
Erik Krogen commented on HDFS-14090:
------------------------------------
Hey [~crh], strong apologies for my 1-2 days to get a review turning into 2
weeks. But here I am finally :) This is great work overall!
# {{RouterRpcClient#concurrentNS}} should be a {{static}} {{CONSTANT}}. Also it
seems that this is redefined in {{StaticFairnessPolicyController}}, we should
only define it in one place. Also duplicated in {{TestRouterFairnessManager}}.
# {{FairnessManager}} L55, you don't need {{throws IOException}} here
# Can you explain why we have {{FairnessManager}} instead of having
{{RouterRpcClient}} directly access the {{FairnessPolicyController}}? It seems
like it doesn't really add much benefit?
# I don't see the point in having {{FairnessPolicyController#acquirePermit()}}
throw an exception to indicate failure, rather than returning a boolean, if
{{FairnessManager}} is just going to catch it and convert it to a boolean
anyway. This is on a hot path (if there are permit rejections being thrown then
the Router must be busy) and creating exceptions is somewhat costly IIRC.
# In {{FairnessPolicyController}}, why not just call the initialization method
{{init()}} or {{initialize()}} instead of {{assignHandlersToNameservices()}} ?
But looking more closely, this method is only ever called in the constructor
for {{StaticFairnessPolicyController}}. I think you should either refactor this
so that an initialization method is called by
{{FederationUtil#newFairnessPolicyResolver}} (and have a no-op constructor), or
remove this method from the interface since it isn't used externally.
# In the Javadoc for {{NoFairnessPolicyController}}, you have an at-symbol like
you intended to link something, is this a typo?
# {{FederationUtil#newFairnessPolicyResolver}} -> should this be called
{{newFairnessPolicyController}} ?
# {{StaticFairnessPolicyController}} L78, I don't think we need the null-check
here, as {{getAllConfiguredNS}} won't ever put null values into the set. I
don't don't think you need the containment check on L83, since you're iterating
over a set (no duplicates) to add items to the map.
# {{StaticFairnessPolicyController}} L113, you are calling {{toString()}} on a
{{Set}} which doesn't have a good {{toString()}} implementation, you should use
something else.
# {{StaticFairnessPolicyController}} L183, you are re-joining
{{namenodeSplit}}, you can just use the original variable {{namenode}}
# Within {{TestRouterFairnessManager}}, {{createConf}} and {{generateBaseConf}}
seem to be the same thing?
# Within {{TestRouterFairnessManager#verifyInstantiationError}}, would it be
more idiomatic/simpler to use the {{GenericTestUtils.LogCapturer}}?
# {{TestRouterFairnessManager}} really seems like it's testing the logic of
{{StaticFairnessPolicyController}}, not {{FairnessManager}}. Should we rename
this test?
# For {{TestRouterHandlersFairness}}, I like the intent, but I'm worried this
is going to become another flaky test in the build. Can you convince me that it
won't be flaky?
> RBF: Improved isolation for downstream name nodes. {Static}
> -----------------------------------------------------------
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
> Issue Type: New Feature
> Reporter: CR Hota
> Assignee: CR Hota
> Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch,
> HDFS-14090-HDFS-13891.002.patch, HDFS-14090-HDFS-13891.003.patch,
> HDFS-14090-HDFS-13891.004.patch, HDFS-14090-HDFS-13891.005.patch,
> HDFS-14090.006.patch, HDFS-14090.007.patch, HDFS-14090.008.patch,
> HDFS-14090.009.patch, HDFS-14090.010.patch, HDFS-14090.011.patch,
> HDFS-14090.012.patch, HDFS-14090.013.patch, HDFS-14090.014.patch, RBF_
> Isolation design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should
> help minimize impact of clients connecting to healthy clusters vs unhealthy
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is
> heavily loaded with calls spiking rpc queue times, due to back pressure the
> same with start reflecting on the router. As a result of this, clients
> connecting to healthy/faster name nodes will also slow down as same rpc queue
> is maintained for all calls at the router layer. Essentially the same IPC
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we
> can change the architecture and add some throttling logic for
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify
> downstream name node and maintain a separate queue for each underlying name
> node. Another simpler way is to maintain some sort of rate limiter configured
> for each name node and let routers drop/reject/send error requests after
> certain threshold.
> This won’t be a simple change as router’s ‘Server’ layer would need redesign
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]