> On July 28, 2017, 5:02 a.m., Sumit Mohanty wrote: > > ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java > > Line 126 (original), 127 (patched) > > <https://reviews.apache.org/r/61203/diff/1/?file=1785078#file1785078line127> > > > > What was the reason, both instances failed? > > Aravindan Vijayan wrote: > It is because of the code which deleted and recreated the znode whenever > a sub path is not found (i.e ZkNoNodeException is thrown from ZkHelixAdmin) > > Let's say collectors A & B start this at the same time. > > A : Check parent /ambari-metrics-cluster. Not found. Create parent > /ambari-metrics-cluster > B : Check parent /ambari-metrics-cluster. Found. So return true. > B : Try to check child C2. Not yet created by A. ZkNoNodeException thrown. > B : Catch exception. Delete the entire znode. > A : Try to create a child node. Someone deleted the top level znode > itself. ZkNoNodeException thrown. > A : Catch exception. Try to Delete the entire znode. > A : Deleted children C1, C3 > B : Created /ambari-metrics-cluster and children nodes C1, C2, C3. > A : Deleted child C2. > B : Trying to delete root node. Failed since directory not empty > --------> FAILED START > B : Finished creating /ambari-metrics-cluster. > A : Access C2. Not found. ------> FAILED START.
Small mistake. Correct flow. A : Check parent /ambari-metrics-cluster. Not found. Create parent /ambari-metrics-cluster B : Check parent /ambari-metrics-cluster. Found. So return true. B : Try to check child C2. Not yet created by A. ZkNoNodeException thrown. B : Catch exception. Delete the entire znode. A : Try to create a child node. Someone deleted the top level znode itself. ZkNoNodeException thrown. A : Catch exception. Try to Delete the entire znode. A : Deleted children C1, C3 B : Created /ambari-metrics-cluster and children nodes C1, C2, C3. A : Deleted child C2. A : Trying to delete root node. Failed since directory not empty (C1 and C3 are there) --------> FAILED START B : Finished creating /ambari-metrics-cluster. B : Access C2. Not found. ------> FAILED START. - Aravindan ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/61203/#review181646 ----------------------------------------------------------- On July 28, 2017, 4:50 a.m., Aravindan Vijayan wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/61203/ > ----------------------------------------------------------- > > (Updated July 28, 2017, 4:50 a.m.) > > > Review request for Ambari, Dmytro Sen, Sumit Mohanty, and Sid Wagle. > > > Bugs: AMBARI-21593 > https://issues.apache.org/jira/browse/AMBARI-21593 > > > Repository: ambari > > > Description > ------- > > PROBLEM > When 2 metric collectors are started up simultaneously, both of them fail to > start. > > BUG > There exists a race condition in the Metric Collector HA controller > initialization which was introduced through AMBARI-20179Link. When a helix > controller instance finds that the /ambari-metrics-collector znode exists but > a child node does not exists, it deletes the entire znode and recreates. If > another controller instance also initializes simultaneously, a race condition > can occur wherein each instance will end up cancelling the effort of the > other. > > FIX > Do not delete and recreate the znode. Wait and retry for a few seconds to > check if /ambari-metrics-collector was fully initailized. > > > Diffs > ----- > > > ambari-metrics/ambari-metrics-timelineservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/metrics/timeline/availability/MetricCollectorHAController.java > 53e6304 > > > Diff: https://reviews.apache.org/r/61203/diff/1/ > > > Testing > ------- > > Manually tested. > > > Thanks, > > Aravindan Vijayan > >