Re: Review Request 36231: Revert, The Default hdfs-site.xml Should Have Client Retry Logic Enabled For Rolling Upgrade

Alejandro Fernandez Tue, 07 Jul 2015 10:57:52 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36231/
-----------------------------------------------------------


(Updated July 7, 2015, 5:57 p.m.)


Review request for Ambari, Jonathan Hurley and Nate Cole.


Bugs: AMBARI-11192
    https://issues.apache.org/jira/browse/AMBARI-11192


Repository: ambari


Description
-------

In the case of an HA cluster where the former primary NN was killed "dirty", by 
catastrophic power-down or equivalent, and the cluster has successfully failed 
over to the other NN, a client that first attempts to contact the dead NN takes 
10 minutes to switch to the other NN.

In Ambari 2.0 and HDP 2.2, dfs.client.retry.policy.enabled was not set at all.
Recently, in Ambari 2.1 for HDP 2.3, it was defaulted to true as part of 
AMBARI-11192.
However, this causes problems during RU

In an HA setup, our retry actually should be handled by RetryInvocationHandler 
using retry policy FailoverOnNetworkExceptionRetry. The client first translates 
the nameservice ID into two host names, and creates an individual RPC proxy for 
each NameNode accordingly. Each individual NameNode proxy still uses 
MultipleLinearRandomRetry as its local retry policy, but because we usually set 
dfs.client.retry.policy.enabled to false, thus this internal retry is actually 
disabled. Then in case we hit any connection issue or remote exception 
(including StandbyException), the exception is caught by RetryInvocationHandler 
and handled according to FailoverOnNetworkExceptionRetry. In this way the 
client can failover to the other namenode immediately instead of keeping 
retrying the same NameNode.
However, here because we set dfs.client.retry.policy.enabled to true, the 
MultipleLinearRandomRetry is triggered inside of the internal NameNode proxy 
thus we have to wait 10+ min. The exception is finally thrown to 
RetryInvocationHandler until all the retries of MultipleLinearRandomRetry fail.


Diffs
-----

  
ambari-server/src/main/java/org/apache/ambari/server/checks/CheckDescription.java
 5e029f4 
  
ambari-server/src/main/java/org/apache/ambari/server/checks/ClientRetryPropertyCheck.java
 4beba33 
  
ambari-server/src/main/resources/stacks/HDP/2.2/services/HDFS/configuration/hdfs-site.xml
 e42b3f8 
  
ambari-server/src/test/java/org/apache/ambari/server/checks/ClientRetryPropertyCheckTest.java
 d3fd187 

Diff: https://reviews.apache.org/r/36231/diff/


Testing
-------

Unit tests passed,

----------------------------------------------------------------------
Total run:761
Total errors:0
Total failures:0
OK

I deployed my changes to a brand new cluster and it correctly set the hdfs-site 
property dfs.client.retry.policy.enabled to false.


Thanks,

Alejandro Fernandez

Re: Review Request 36231: Revert, The Default hdfs-site.xml Should Have Client Retry Logic Enabled For Rolling Upgrade

Reply via email to