----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36231/#review90739 -----------------------------------------------------------
Ship it! Ship It! - Jonathan Hurley On July 6, 2015, 7:58 p.m., Alejandro Fernandez wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/36231/ > ----------------------------------------------------------- > > (Updated July 6, 2015, 7:58 p.m.) > > > Review request for Ambari, Jonathan Hurley and Nate Cole. > > > Repository: ambari > > > Description > ------- > > In the case of an HA cluster where the former primary NN was killed "dirty", > by catastrophic power-down or equivalent, and the cluster has successfully > failed over to the other NN, a client that first attempts to contact the dead > NN takes 10 minutes to switch to the other NN. > > In Ambari 2.0 and HDP 2.2, dfs.client.retry.policy.enabled was not set at all. > Recently, in Ambari 2.1 for HDP 2.3, it was defaulted to true as part of > AMBARI-11192. > However, this causes problems during RU > > In an HA setup, our retry actually should be handled by > RetryInvocationHandler using retry policy FailoverOnNetworkExceptionRetry. > The client first translates the nameservice ID into two host names, and > creates an individual RPC proxy for each NameNode accordingly. Each > individual NameNode proxy still uses MultipleLinearRandomRetry as its local > retry policy, but because we usually set dfs.client.retry.policy.enabled to > false, thus this internal retry is actually disabled. Then in case we hit any > connection issue or remote exception (including StandbyException), the > exception is caught by RetryInvocationHandler and handled according to > FailoverOnNetworkExceptionRetry. In this way the client can failover to the > other namenode immediately instead of keeping retrying the same NameNode. > However, here because we set dfs.client.retry.policy.enabled to true, the > MultipleLinearRandomRetry is triggered inside of the internal NameNode proxy > thus we have to wait 10+ min. The exception is finally thrown to > RetryInvocationHandler until all the retries of MultipleLinearRandomRetry > fail. > > > Diffs > ----- > > > ambari-server/src/main/java/org/apache/ambari/server/checks/CheckDescription.java > 5e029f4 > > ambari-server/src/main/java/org/apache/ambari/server/checks/ClientRetryPropertyCheck.java > 4beba33 > > ambari-server/src/main/resources/stacks/HDP/2.2/services/HDFS/configuration/hdfs-site.xml > e42b3f8 > > ambari-server/src/test/java/org/apache/ambari/server/checks/ClientRetryPropertyCheckTest.java > d3fd187 > > Diff: https://reviews.apache.org/r/36231/diff/ > > > Testing > ------- > > Unit tests passed, > > ---------------------------------------------------------------------- > Total run:761 > Total errors:0 > Total failures:0 > OK > > I deployed my changes to a brand new cluster and it correctly set the > hdfs-site property dfs.client.retry.policy.enabled to false. > > > Thanks, > > Alejandro Fernandez > >
