[
https://issues.apache.org/jira/browse/KNOX-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16603333#comment-16603333
]
Matthew Sharp commented on KNOX-1093:
-------------------------------------
Replacing retryRequest() with failoverRequest() does result in the expected
behavior mentioned above. Attached a patch that replaces and cleans up un-used
retryRequest() method and variables.
Test cluster shows proper failover occur:
2018-09-04 12:01:21,495 INFO knox.gateway
(AbstractHdfsHaDispatch.java:executeRequest(85)) - Received an error from a
node in SafeMode: org.apache.knox.gateway.hdfs.dispatch.SafeModeException
2018-09-04 12:01:21,496 INFO knox.gateway
(AbstractHdfsHaDispatch.java:failoverRequest(115)) - Failing over request to a
different server:
http://host1.test.com:50070/webhdfs/v1/user/matt/test.txt?op=CREATE&doAs=matt
> KNOX Not Handling safemode state of one of the NameNode In HA state
> --------------------------------------------------------------------
>
> Key: KNOX-1093
> URL: https://issues.apache.org/jira/browse/KNOX-1093
> Project: Apache Knox
> Issue Type: Bug
> Components: Server
> Affects Versions: 0.10.0
> Reporter: Rajesh Chandramohan
> Priority: Major
> Fix For: 1.2.0
>
> Attachments: KNOX-1093.patch
>
>
> per your code WebHdfsHaDispatch.java , When Safemode exception happened it
> calls the retryRequest() method. which also calls executeRequest() method as
> like failover request but the namenode info is not changing for the thread
> for all of its iteration until maxRetryAttempts=300
> and retrySleep=1000 ( 1 sec )
> After Max 5 minutes , client retries should pick the right namenode atleast
> in next attempt.
> But in this case if we need to copy a set of files in stipulated time there
> is X% of connections falls into these namenode and fails. Can we handle that
> better
> {code:java}
> try {
> inboundResponse = executeOutboundRequest(outboundRequest);
> writeOutboundResponse(outboundRequest, inboundRequest,
> outboundResponse, inboundResponse);
> } catch (StandbyException e) {
> LOG.errorReceivedFromStandbyNode(e);
> failoverRequest(outboundRequest, inboundRequest, outboundResponse,
> inboundResponse, e);
> } catch (SafeModeException e) {
> LOG.errorReceivedFromSafeModeNode(e);
> retryRequest(outboundRequest, inboundRequest, outboundResponse,
> inboundResponse, e);
> } catch (IOException e) {
> LOG.errorConnectingToServer(outboundRequest.getURI().toString(), e);
> failoverRequest(outboundRequest, inboundRequest, outboundResponse,
> inboundResponse, e);
> }
> }
> {code}
> Need to change the logic in SafeModeexception state in KNOX HADispatch code
> to flag the namenode which is stuck in safemode and maintain don't try queue
> and redirect all further connection only to healthy active namenode . This
> way X5 of failures we can handle. What do we think
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)