Re: Review Request 36231: Revert, The Default hdfs-site.xml Should Have Client Retry Logic Enabled For Rolling Upgrade

Jonathan Hurley Tue, 07 Jul 2015 10:55:58 -0700


> On July 6, 2015, 8:22 p.m., Jonathan Hurley wrote:
> > Won't reverting this patch cause problems with the long running tests?
> 
> Alejandro Fernandez wrote:
>     Basically, HDFS has two layers of retries, and only of them should really 
> be used during HA. When one NameNode is up, setting 
> dfs.client.retry.policy.enabled to false allows it to quickly find the active 
> (in the case of HA and when a client tried to connect to the dead NameNode), 
> or retry for up to 3 mins in non-HA.
>     
>     Comment from Jing,
>     "dfs.client.retry.policy.enabled" is only used for client connecting to 
> NameNode so the change will not affect other components. In the meanwhile, 
> after setting the conf property back to false, the retry is controlled only 
> by FailoverOnNetworkExceptionRetry, which can retry up to 10 times and 
> failover 15 times by default. In case that both NameNodes are down because 
> only one retry policy takes effect the total retry time will be much less 
> compared with the case we set "dfs.client.retry.policy.enabled" to true. If 
> both NameNodes are in the standby state the total retry time will not be 
> affected.
> 
> Jonathan Hurley wrote:
>     From the 2.2 to 2.3 runbook, section entitled "Cluster Prerequisites":
>     
>     
>     • Enable client retry properties for HDFS, Hive, and Oozie. These 
> properties are not included by default, so you might need to add them to the 
> site files.
>     
>     • For HDFS, set dfs.client.retry.policy.enabled to true in hdfs-site.xml 
> on all nodes with HDFS services.
>     
>     So, is the runbook wrong?
> 
> Alejandro Fernandez wrote:
>     HDFS team realized that setting is wrong. I'll ask them to update the 
> runbook.


That's what I needed to hear! :)


- Jonathan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36231/#review90613
-----------------------------------------------------------


On July 6, 2015, 7:58 p.m., Alejandro Fernandez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/36231/
> -----------------------------------------------------------
> 
> (Updated July 6, 2015, 7:58 p.m.)
> 
> 
> Review request for Ambari, Jonathan Hurley and Nate Cole.
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> In the case of an HA cluster where the former primary NN was killed "dirty", 
> by catastrophic power-down or equivalent, and the cluster has successfully 
> failed over to the other NN, a client that first attempts to contact the dead 
> NN takes 10 minutes to switch to the other NN.
> 
> In Ambari 2.0 and HDP 2.2, dfs.client.retry.policy.enabled was not set at all.
> Recently, in Ambari 2.1 for HDP 2.3, it was defaulted to true as part of 
> AMBARI-11192.
> However, this causes problems during RU
> 
> In an HA setup, our retry actually should be handled by 
> RetryInvocationHandler using retry policy FailoverOnNetworkExceptionRetry. 
> The client first translates the nameservice ID into two host names, and 
> creates an individual RPC proxy for each NameNode accordingly. Each 
> individual NameNode proxy still uses MultipleLinearRandomRetry as its local 
> retry policy, but because we usually set dfs.client.retry.policy.enabled to 
> false, thus this internal retry is actually disabled. Then in case we hit any 
> connection issue or remote exception (including StandbyException), the 
> exception is caught by RetryInvocationHandler and handled according to 
> FailoverOnNetworkExceptionRetry. In this way the client can failover to the 
> other namenode immediately instead of keeping retrying the same NameNode.
> However, here because we set dfs.client.retry.policy.enabled to true, the 
> MultipleLinearRandomRetry is triggered inside of the internal NameNode proxy 
> thus we have to wait 10+ min. The exception is finally thrown to 
> RetryInvocationHandler until all the retries of MultipleLinearRandomRetry 
> fail.
> 
> 
> Diffs
> -----
> 
>   
> ambari-server/src/main/java/org/apache/ambari/server/checks/CheckDescription.java
>  5e029f4 
>   
> ambari-server/src/main/java/org/apache/ambari/server/checks/ClientRetryPropertyCheck.java
>  4beba33 
>   
> ambari-server/src/main/resources/stacks/HDP/2.2/services/HDFS/configuration/hdfs-site.xml
>  e42b3f8 
>   
> ambari-server/src/test/java/org/apache/ambari/server/checks/ClientRetryPropertyCheckTest.java
>  d3fd187 
> 
> Diff: https://reviews.apache.org/r/36231/diff/
> 
> 
> Testing
> -------
> 
> Unit tests passed,
> 
> ----------------------------------------------------------------------
> Total run:761
> Total errors:0
> Total failures:0
> OK
> 
> I deployed my changes to a brand new cluster and it correctly set the 
> hdfs-site property dfs.client.retry.policy.enabled to false.
> 
> 
> Thanks,
> 
> Alejandro Fernandez
> 
>

Re: Review Request 36231: Revert, The Default hdfs-site.xml Should Have Client Retry Logic Enabled For Rolling Upgrade

Reply via email to