[jira] [Commented] (KNOX-1093) KNOX Not Handling safemode state of one of the NameNode In HA state

Matthew Sharp (JIRA) Tue, 04 Sep 2018 10:08:09 -0700


    [ 
https://issues.apache.org/jira/browse/KNOX-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16603333#comment-16603333
 ]


Matthew Sharp commented on KNOX-1093:
-------------------------------------

Replacing retryRequest() with failoverRequest() does result in the expected 
behavior mentioned above.  Attached a patch that replaces and cleans up un-used 
retryRequest() method and variables. 

 

Test cluster shows proper failover occur:

2018-09-04 12:01:21,495 INFO knox.gateway 
(AbstractHdfsHaDispatch.java:executeRequest(85)) - Received an error from a 
node in SafeMode: org.apache.knox.gateway.hdfs.dispatch.SafeModeException
2018-09-04 12:01:21,496 INFO knox.gateway 
(AbstractHdfsHaDispatch.java:failoverRequest(115)) - Failing over request to a 
different server: 
http://host1.test.com:50070/webhdfs/v1/user/matt/test.txt?op=CREATE&doAs=matt

> KNOX Not Handling safemode state of one of the NameNode In HA state 
> --------------------------------------------------------------------
>
>                 Key: KNOX-1093
>                 URL: https://issues.apache.org/jira/browse/KNOX-1093
>             Project: Apache Knox
>          Issue Type: Bug
>          Components: Server
>    Affects Versions: 0.10.0
>            Reporter: Rajesh Chandramohan
>            Priority: Major
>             Fix For: 1.2.0
>
>         Attachments: KNOX-1093.patch
>
>
>  per your code WebHdfsHaDispatch.java , When Safemode exception happened it 
> calls the retryRequest() method. which also calls executeRequest() method as 
> like failover request but the namenode info is not changing for the thread 
> for all of its iteration until maxRetryAttempts=300 
> and retrySleep=1000 ( 1 sec ) 
> After Max 5 minutes , client retries should pick the right namenode atleast 
> in next attempt.
>  But in this case if we need to copy a set of files in stipulated time there 
> is X% of connections falls into these namenode and fails. Can we handle that 
> better
> {code:java}
> try {
>          inboundResponse = executeOutboundRequest(outboundRequest);
>          writeOutboundResponse(outboundRequest, inboundRequest, 
> outboundResponse, inboundResponse);
>       } catch (StandbyException e) {
>          LOG.errorReceivedFromStandbyNode(e);
>          failoverRequest(outboundRequest, inboundRequest, outboundResponse, 
> inboundResponse, e);
>       } catch (SafeModeException e) {
>          LOG.errorReceivedFromSafeModeNode(e);
>          retryRequest(outboundRequest, inboundRequest, outboundResponse, 
> inboundResponse, e);
>       } catch (IOException e) {
>          LOG.errorConnectingToServer(outboundRequest.getURI().toString(), e);
>          failoverRequest(outboundRequest, inboundRequest, outboundResponse, 
> inboundResponse, e);
>       }
>    }
> {code}
> Need to change the logic in SafeModeexception state in  KNOX HADispatch code 
> to flag the namenode which is stuck in safemode  and maintain don't try queue 
> and redirect all further connection only to healthy active namenode . This 
> way X5 of failures we can handle. What do we think



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KNOX-1093) KNOX Not Handling safemode state of one of the NameNode In HA state

Reply via email to