> On July 6, 2015, 8:22 p.m., Jonathan Hurley wrote: > > Won't reverting this patch cause problems with the long running tests? > > Alejandro Fernandez wrote: > Basically, HDFS has two layers of retries, and only of them should really > be used during HA. When one NameNode is up, setting > dfs.client.retry.policy.enabled to false allows it to quickly find the active > (in the case of HA and when a client tried to connect to the dead NameNode), > or retry for up to 3 mins in non-HA. > > Comment from Jing, > "dfs.client.retry.policy.enabled" is only used for client connecting to > NameNode so the change will not affect other components. In the meanwhile, > after setting the conf property back to false, the retry is controlled only > by FailoverOnNetworkExceptionRetry, which can retry up to 10 times and > failover 15 times by default. In case that both NameNodes are down because > only one retry policy takes effect the total retry time will be much less > compared with the case we set "dfs.client.retry.policy.enabled" to true. If > both NameNodes are in the standby state the total retry time will not be > affected. > > Jonathan Hurley wrote: > From the 2.2 to 2.3 runbook, section entitled "Cluster Prerequisites": > > > • Enable client retry properties for HDFS, Hive, and Oozie. These > properties are not included by default, so you might need to add them to the > site files. > > • For HDFS, set dfs.client.retry.policy.enabled to true in hdfs-site.xml > on all nodes with HDFS services. > > So, is the runbook wrong? > > Alejandro Fernandez wrote: > HDFS team realized that setting is wrong. I'll ask them to update the > runbook.
That's what I needed to hear! :) - Jonathan ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36231/#review90613 ----------------------------------------------------------- On July 6, 2015, 7:58 p.m., Alejandro Fernandez wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/36231/ > ----------------------------------------------------------- > > (Updated July 6, 2015, 7:58 p.m.) > > > Review request for Ambari, Jonathan Hurley and Nate Cole. > > > Repository: ambari > > > Description > ------- > > In the case of an HA cluster where the former primary NN was killed "dirty", > by catastrophic power-down or equivalent, and the cluster has successfully > failed over to the other NN, a client that first attempts to contact the dead > NN takes 10 minutes to switch to the other NN. > > In Ambari 2.0 and HDP 2.2, dfs.client.retry.policy.enabled was not set at all. > Recently, in Ambari 2.1 for HDP 2.3, it was defaulted to true as part of > AMBARI-11192. > However, this causes problems during RU > > In an HA setup, our retry actually should be handled by > RetryInvocationHandler using retry policy FailoverOnNetworkExceptionRetry. > The client first translates the nameservice ID into two host names, and > creates an individual RPC proxy for each NameNode accordingly. Each > individual NameNode proxy still uses MultipleLinearRandomRetry as its local > retry policy, but because we usually set dfs.client.retry.policy.enabled to > false, thus this internal retry is actually disabled. Then in case we hit any > connection issue or remote exception (including StandbyException), the > exception is caught by RetryInvocationHandler and handled according to > FailoverOnNetworkExceptionRetry. In this way the client can failover to the > other namenode immediately instead of keeping retrying the same NameNode. > However, here because we set dfs.client.retry.policy.enabled to true, the > MultipleLinearRandomRetry is triggered inside of the internal NameNode proxy > thus we have to wait 10+ min. The exception is finally thrown to > RetryInvocationHandler until all the retries of MultipleLinearRandomRetry > fail. > > > Diffs > ----- > > > ambari-server/src/main/java/org/apache/ambari/server/checks/CheckDescription.java > 5e029f4 > > ambari-server/src/main/java/org/apache/ambari/server/checks/ClientRetryPropertyCheck.java > 4beba33 > > ambari-server/src/main/resources/stacks/HDP/2.2/services/HDFS/configuration/hdfs-site.xml > e42b3f8 > > ambari-server/src/test/java/org/apache/ambari/server/checks/ClientRetryPropertyCheckTest.java > d3fd187 > > Diff: https://reviews.apache.org/r/36231/diff/ > > > Testing > ------- > > Unit tests passed, > > ---------------------------------------------------------------------- > Total run:761 > Total errors:0 > Total failures:0 > OK > > I deployed my changes to a brand new cluster and it correctly set the > hdfs-site property dfs.client.retry.policy.enabled to false. > > > Thanks, > > Alejandro Fernandez > >
