[
https://issues.apache.org/jira/browse/KNOX-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16632073#comment-16632073
]
Phil Zampino commented on KNOX-1093:
------------------------------------
I've updated the broken "retry" tests to "failover" tests.
> KNOX Not Handling safemode state of one of the NameNode In HA state
> --------------------------------------------------------------------
>
> Key: KNOX-1093
> URL: https://issues.apache.org/jira/browse/KNOX-1093
> Project: Apache Knox
> Issue Type: Bug
> Components: Server
> Affects Versions: 0.10.0
> Reporter: Rajesh Chandramohan
> Assignee: Matthew Sharp
> Priority: Major
> Fix For: 1.2.0
>
> Attachments: KNOX-1093.patch
>
>
> per your code WebHdfsHaDispatch.java , When Safemode exception happened it
> calls the retryRequest() method. which also calls executeRequest() method as
> like failover request but the namenode info is not changing for the thread
> for all of its iteration until maxRetryAttempts=300
> and retrySleep=1000 ( 1 sec )
> After Max 5 minutes , client retries should pick the right namenode atleast
> in next attempt.
> But in this case if we need to copy a set of files in stipulated time there
> is X% of connections falls into these namenode and fails. Can we handle that
> better
> {code:java}
> try {
> inboundResponse = executeOutboundRequest(outboundRequest);
> writeOutboundResponse(outboundRequest, inboundRequest,
> outboundResponse, inboundResponse);
> } catch (StandbyException e) {
> LOG.errorReceivedFromStandbyNode(e);
> failoverRequest(outboundRequest, inboundRequest, outboundResponse,
> inboundResponse, e);
> } catch (SafeModeException e) {
> LOG.errorReceivedFromSafeModeNode(e);
> retryRequest(outboundRequest, inboundRequest, outboundResponse,
> inboundResponse, e);
> } catch (IOException e) {
> LOG.errorConnectingToServer(outboundRequest.getURI().toString(), e);
> failoverRequest(outboundRequest, inboundRequest, outboundResponse,
> inboundResponse, e);
> }
> }
> {code}
> Need to change the logic in SafeModeexception state in KNOX HADispatch code
> to flag the namenode which is stuck in safemode and maintain don't try queue
> and redirect all further connection only to healthy active namenode . This
> way X5 of failures we can handle. What do we think
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)