> On July 7, 2015, 12:22 a.m., Jonathan Hurley wrote: > > Won't reverting this patch cause problems with the long running tests?
Basically, HDFS has two layers of retries, and only of them should really be used during HA. When one NameNode is up, setting dfs.client.retry.policy.enabled to false allows it to quickly find the active (in the case of HA and when a client tried to connect to the dead NameNode), or retry for up to 3 mins in non-HA. Comment from Jing, "dfs.client.retry.policy.enabled" is only used for client connecting to NameNode so the change will not affect other components. In the meanwhile, after setting the conf property back to false, the retry is controlled only by FailoverOnNetworkExceptionRetry, which can retry up to 10 times and failover 15 times by default. In case that both NameNodes are down because only one retry policy takes effect the total retry time will be much less compared with the case we set "dfs.client.retry.policy.enabled" to true. If both NameNodes are in the standby state the total retry time will not be affected. - Alejandro ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36231/#review90613 ----------------------------------------------------------- On July 6, 2015, 11:58 p.m., Alejandro Fernandez wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/36231/ > ----------------------------------------------------------- > > (Updated July 6, 2015, 11:58 p.m.) > > > Review request for Ambari, Jonathan Hurley and Nate Cole. > > > Repository: ambari > > > Description > ------- > > In the case of an HA cluster where the former primary NN was killed "dirty", > by catastrophic power-down or equivalent, and the cluster has successfully > failed over to the other NN, a client that first attempts to contact the dead > NN takes 10 minutes to switch to the other NN. > > In Ambari 2.0 and HDP 2.2, dfs.client.retry.policy.enabled was not set at all. > Recently, in Ambari 2.1 for HDP 2.3, it was defaulted to true as part of > AMBARI-11192. > However, this causes problems during RU > > In an HA setup, our retry actually should be handled by > RetryInvocationHandler using retry policy FailoverOnNetworkExceptionRetry. > The client first translates the nameservice ID into two host names, and > creates an individual RPC proxy for each NameNode accordingly. Each > individual NameNode proxy still uses MultipleLinearRandomRetry as its local > retry policy, but because we usually set dfs.client.retry.policy.enabled to > false, thus this internal retry is actually disabled. Then in case we hit any > connection issue or remote exception (including StandbyException), the > exception is caught by RetryInvocationHandler and handled according to > FailoverOnNetworkExceptionRetry. In this way the client can failover to the > other namenode immediately instead of keeping retrying the same NameNode. > However, here because we set dfs.client.retry.policy.enabled to true, the > MultipleLinearRandomRetry is triggered inside of the internal NameNode proxy > thus we have to wait 10+ min. The exception is finally thrown to > RetryInvocationHandler until all the retries of MultipleLinearRandomRetry > fail. > > > Diffs > ----- > > > ambari-server/src/main/java/org/apache/ambari/server/checks/CheckDescription.java > 5e029f4 > > ambari-server/src/main/java/org/apache/ambari/server/checks/ClientRetryPropertyCheck.java > 4beba33 > > ambari-server/src/main/resources/stacks/HDP/2.2/services/HDFS/configuration/hdfs-site.xml > e42b3f8 > > ambari-server/src/test/java/org/apache/ambari/server/checks/ClientRetryPropertyCheckTest.java > d3fd187 > > Diff: https://reviews.apache.org/r/36231/diff/ > > > Testing > ------- > > Unit tests passed, > > ---------------------------------------------------------------------- > Total run:761 > Total errors:0 > Total failures:0 > OK > > I deployed my changes to a brand new cluster and it correctly set the > hdfs-site property dfs.client.retry.policy.enabled to false. > > > Thanks, > > Alejandro Fernandez > >
