ArafatKhan2198 commented on code in PR #10384:
URL: https://github.com/apache/ozone/pull/10384#discussion_r3386212205


##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/scm/ReconStorageContainerManagerFacade.java:
##########
@@ -896,22 +897,35 @@ public boolean triggerTargetedSCMContainerSync() {
     }
   }
 
-  private boolean runTargetedSyncWithMetrics() {
+  public boolean syncWithSCMContainerInfo() {

Review Comment:
   The facade now has both:
   
   triggerSCMContainerSync()
   syncWithSCMContainerInfo()
   
   Both appear to do almost the same thing: check if sync is already running, 
then call runScmContainerSyncWithMetrics().



##########
hadoop-ozone/integration-test-recon/src/test/java/org/apache/hadoop/ozone/recon/TestNSSummaryMemoryLeak.java:
##########
@@ -240,7 +240,7 @@ public void testNSSummaryCleanupOnHardDelete() throws 
Exception {
     // This simulates the background process that hard deletes entries
     simulateHardDelete(omMetadataManager);
     
-    // Verify cleanup after hard delete simulation.
+    // Verify memory leak fix - NSSummary entries should be cleaned up

Review Comment:
   The `TestNSSummaryMemoryLeak` change looks unrelated to SCM container sync 
metrics. If this is required to stabilize the test, can we mention the reason 
in the PR description? Otherwise, it may be better to move it to a separate PR.



##########
hadoop-ozone/recon/src/test/java/org/apache/hadoop/ozone/recon/scm/TestReconStorageContainerSyncHelper.java:
##########
@@ -47,16 +54,51 @@ class TestReconStorageContainerSyncHelper {
   private final ReconContainerManager mockContainerManager =
       mock(ReconContainerManager.class);
 
-  private final ReconStorageContainerSyncHelper syncHelper;
+  private ReconScmContainerSyncMetrics metrics;
+  private ReconStorageContainerSyncHelper syncHelper;
 
-  TestReconStorageContainerSyncHelper() {
+  @BeforeEach
+  void setUp() {
+    metrics = ReconScmContainerSyncMetrics.create();
     syncHelper = new ReconStorageContainerSyncHelper(
         mockScmServiceProvider,
         new OzoneConfiguration(),
-        mockContainerManager
+        mockContainerManager,
+        metrics
     );
   }
 
+  @AfterEach
+  void tearDown() {
+    metrics.unRegister();
+  }
+
+  @Test
+  void testContainerSyncMetricsTrackPreSyncDriftAndDuration() throws Exception 
{
+    when(mockScmServiceProvider.getContainerCount(OPEN)).thenReturn(5L);
+    
when(mockScmServiceProvider.getContainerCount(QUASI_CLOSED)).thenReturn(1L);
+    when(mockScmServiceProvider.getContainerCount(CLOSED)).thenReturn(8L);
+    when(mockScmServiceProvider.getContainerCount(DELETED)).thenReturn(9L);
+    when(mockContainerManager.getContainerStateCount(OPEN)).thenReturn(3);
+    
when(mockContainerManager.getContainerStateCount(QUASI_CLOSED)).thenReturn(1);
+    when(mockContainerManager.getContainerStateCount(CLOSED)).thenReturn(10);
+    when(mockContainerManager.getContainerStateCount(DELETED)).thenReturn(7);
+    when(mockScmServiceProvider.getListOfContainerIDs(
+        any(), any(Integer.class), any())).thenReturn(Collections.emptyList());
+
+    boolean result = syncHelper.syncWithSCMContainerInfo();
+
+    assertTrue(result);
+    assertEquals(2L, metrics.getContainerCountDrift(OPEN));
+    assertEquals(0L, metrics.getContainerCountDrift(QUASI_CLOSED));
+    assertEquals(-2L, metrics.getContainerCountDrift(CLOSED));
+    assertEquals(2L, metrics.getContainerCountDrift(DELETED));
+    assertTrue(metrics.getContainerSyncDurationMs(OPEN) >= 0);

Review Comment:
   The duration assertions only check `>= 0`, so they would pass even if the 
value stayed at the default `0`. This is okay as a lightweight check, but 
stronger facade-level tests around the `finally` duration update would give 
better coverage.



##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/metrics/ReconScmContainerSyncMetrics.java:
##########
@@ -75,19 +103,115 @@ public void unRegister() {
     ms.unregisterSource(SOURCE_NAME);
   }
 
-  public void setTargetedSyncStatus(int status) {
-    targetedSyncStatus.set(status);
+  public void setScmContainerSyncStatus(int status) {
+    scmContainerSyncStatus.set(status);
+  }
+
+  public void setScmContainerSyncDurationMs(long durationMs) {
+    scmContainerSyncDurationMs.set(durationMs);
+  }
+
+  public void setContainerSyncDurationMs(
+      HddsProtos.LifeCycleState state, long durationMs) {
+    setStateGauge(containerSyncDurationMs, state, durationMs);
+  }
+
+  public void setContainerCountDrift(
+      HddsProtos.LifeCycleState state, long drift) {
+    setStateGauge(containerCountDrift, state, drift);
+  }
+
+  public int getScmContainerSyncStatus() {
+    return scmContainerSyncStatus.get();
+  }
+
+  public long getScmContainerSyncDurationMs() {
+    return scmContainerSyncDurationMs.get();
+  }
+
+  public long getContainerSyncDurationMs(
+      HddsProtos.LifeCycleState state) {
+    return getStateGauge(containerSyncDurationMs, state);
+  }
+
+  public long getContainerCountDrift(
+      HddsProtos.LifeCycleState state) {
+    return getStateGauge(containerCountDrift, state);
+  }
+
+  @Override
+  public void getMetrics(MetricsCollector collector, boolean all) {
+    MetricsRecordBuilder builder = collector.addRecord(SOURCE_NAME);
+    builder.addGauge(SCM_CONTAINER_SYNC_STATUS, getScmContainerSyncStatus());
+    builder.addGauge(SCM_CONTAINER_SYNC_DURATION_MS,
+        getScmContainerSyncDurationMs());
+    for (HddsProtos.LifeCycleState state : SYNC_STATES) {
+      builder.addGauge(containerSyncDurationMetricInfo.get(state),
+          getContainerSyncDurationMs(state));
+      builder.addGauge(containerCountDriftMetricInfo.get(state),
+          getContainerCountDrift(state));
+    }
+  }
+
+  private static Map<HddsProtos.LifeCycleState, AtomicLong>
+      initStateGaugeValues() {
+    Map<HddsProtos.LifeCycleState, AtomicLong> gauges =
+        new EnumMap<>(HddsProtos.LifeCycleState.class);
+    for (HddsProtos.LifeCycleState state : SYNC_STATES) {
+      gauges.put(state, new AtomicLong());
+    }
+    return Collections.unmodifiableMap(gauges);
+  }
+
+  private static Map<HddsProtos.LifeCycleState, MetricsInfo>
+      initSyncDurationMetricInfo() {
+    Map<HddsProtos.LifeCycleState, MetricsInfo> metrics =
+        new EnumMap<>(HddsProtos.LifeCycleState.class);
+    for (HddsProtos.LifeCycleState state : SYNC_STATES) {
+      String stateName = metricStateName(state);
+      metrics.put(state, Interns.info(
+          CaseFormat.UPPER_CAMEL.to(CaseFormat.LOWER_CAMEL, stateName)
+              + "ContainerSyncDurationMs",
+          "Time taken by the " + stateName
+              + " container sync pass in milliseconds"));
+    }
+    return Collections.unmodifiableMap(metrics);
+  }
+
+  private static Map<HddsProtos.LifeCycleState, MetricsInfo>
+      initCountDriftMetricInfo() {
+    Map<HddsProtos.LifeCycleState, MetricsInfo> metrics =
+        new EnumMap<>(HddsProtos.LifeCycleState.class);
+    for (HddsProtos.LifeCycleState state : SYNC_STATES) {
+      String stateName = metricStateName(state);
+      metrics.put(state, Interns.info(
+          CaseFormat.UPPER_CAMEL.to(CaseFormat.LOWER_CAMEL, stateName)
+              + "ContainerCountDrift",
+          "Container count drift observed at start of sync pass "

Review Comment:
   Clarify that this is the last successfully observed drift.
   Change the metric description from:
   
   ```
   "Container count drift observed at start of sync pass "
   +"(SCM count minus Recon count for "+stateName+" state)."
   ```
   
   to something like:
   
   ```
   "Last successfully observed container count drift at start of sync pass "
   +"(SCM count minus Recon count for "+stateName+" state)."
   ```
   
   This makes it clear that the value may be from the last successful drift 
calculation, not necessarily the latest attempted sync.
   
   Reason - If updating the container count drift fails, the previous drift 
value remains visible. Is this intended to mean “last successfully observed 
drift”? If yes, can we make that behavior clear in the metric description or 
add a comment? 



##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/metrics/ReconScmContainerSyncMetrics.java:
##########
@@ -75,19 +103,115 @@ public void unRegister() {
     ms.unregisterSource(SOURCE_NAME);
   }
 
-  public void setTargetedSyncStatus(int status) {
-    targetedSyncStatus.set(status);
+  public void setScmContainerSyncStatus(int status) {
+    scmContainerSyncStatus.set(status);
+  }
+
+  public void setScmContainerSyncDurationMs(long durationMs) {
+    scmContainerSyncDurationMs.set(durationMs);
+  }
+
+  public void setContainerSyncDurationMs(
+      HddsProtos.LifeCycleState state, long durationMs) {
+    setStateGauge(containerSyncDurationMs, state, durationMs);
+  }
+
+  public void setContainerCountDrift(
+      HddsProtos.LifeCycleState state, long drift) {
+    setStateGauge(containerCountDrift, state, drift);
+  }
+
+  public int getScmContainerSyncStatus() {
+    return scmContainerSyncStatus.get();
+  }
+
+  public long getScmContainerSyncDurationMs() {
+    return scmContainerSyncDurationMs.get();
+  }
+
+  public long getContainerSyncDurationMs(
+      HddsProtos.LifeCycleState state) {
+    return getStateGauge(containerSyncDurationMs, state);
+  }
+
+  public long getContainerCountDrift(
+      HddsProtos.LifeCycleState state) {
+    return getStateGauge(containerCountDrift, state);
+  }
+
+  @Override
+  public void getMetrics(MetricsCollector collector, boolean all) {
+    MetricsRecordBuilder builder = collector.addRecord(SOURCE_NAME);
+    builder.addGauge(SCM_CONTAINER_SYNC_STATUS, getScmContainerSyncStatus());
+    builder.addGauge(SCM_CONTAINER_SYNC_DURATION_MS,
+        getScmContainerSyncDurationMs());
+    for (HddsProtos.LifeCycleState state : SYNC_STATES) {
+      builder.addGauge(containerSyncDurationMetricInfo.get(state),
+          getContainerSyncDurationMs(state));
+      builder.addGauge(containerCountDriftMetricInfo.get(state),
+          getContainerCountDrift(state));
+    }
+  }
+
+  private static Map<HddsProtos.LifeCycleState, AtomicLong>
+      initStateGaugeValues() {
+    Map<HddsProtos.LifeCycleState, AtomicLong> gauges =
+        new EnumMap<>(HddsProtos.LifeCycleState.class);
+    for (HddsProtos.LifeCycleState state : SYNC_STATES) {
+      gauges.put(state, new AtomicLong());
+    }
+    return Collections.unmodifiableMap(gauges);
+  }
+
+  private static Map<HddsProtos.LifeCycleState, MetricsInfo>
+      initSyncDurationMetricInfo() {
+    Map<HddsProtos.LifeCycleState, MetricsInfo> metrics =
+        new EnumMap<>(HddsProtos.LifeCycleState.class);
+    for (HddsProtos.LifeCycleState state : SYNC_STATES) {
+      String stateName = metricStateName(state);
+      metrics.put(state, Interns.info(
+          CaseFormat.UPPER_CAMEL.to(CaseFormat.LOWER_CAMEL, stateName)
+              + "ContainerSyncDurationMs",
+          "Time taken by the " + stateName
+              + " container sync pass in milliseconds"));
+    }
+    return Collections.unmodifiableMap(metrics);
+  }
+
+  private static Map<HddsProtos.LifeCycleState, MetricsInfo>
+      initCountDriftMetricInfo() {
+    Map<HddsProtos.LifeCycleState, MetricsInfo> metrics =
+        new EnumMap<>(HddsProtos.LifeCycleState.class);
+    for (HddsProtos.LifeCycleState state : SYNC_STATES) {
+      String stateName = metricStateName(state);
+      metrics.put(state, Interns.info(
+          CaseFormat.UPPER_CAMEL.to(CaseFormat.LOWER_CAMEL, stateName)
+              + "ContainerCountDrift",
+          "Container count drift observed at start of sync pass "
+              + "(SCM count minus Recon count for " + stateName + " state)."));
+    }
+    return Collections.unmodifiableMap(metrics);
   }
 
-  public void setLastTargetedSyncDurationMs(long durationMs) {
-    lastTargetedSyncDurationMs.set(durationMs);
+  private static String metricStateName(HddsProtos.LifeCycleState state) {
+    return CaseFormat.UPPER_UNDERSCORE.to(
+        CaseFormat.UPPER_CAMEL, state.name());
   }
 
-  public int getTargetedSyncStatus() {
-    return targetedSyncStatus.value();

Review Comment:
   Let me know if I am missing something.
   
   The PR description says the existing overall metrics remain unchanged, but 
the patch appears to rename `targetedSyncStatus` and 
`lastTargetedSyncDurationMs` to `scmContainerSyncStatus` and 
`scmContainerSyncDurationMs`.
   
   This can break existing Prometheus/Grafana dashboards or alerts. Can we 
either keep the old metric names as aliases or update the PR description to 
clearly call out this compatibility change?



##########
hadoop-ozone/recon/src/test/java/org/apache/hadoop/ozone/recon/scm/TestReconStorageContainerSyncHelper.java:
##########
@@ -47,16 +54,51 @@ class TestReconStorageContainerSyncHelper {
   private final ReconContainerManager mockContainerManager =
       mock(ReconContainerManager.class);
 
-  private final ReconStorageContainerSyncHelper syncHelper;
+  private ReconScmContainerSyncMetrics metrics;
+  private ReconStorageContainerSyncHelper syncHelper;
 
-  TestReconStorageContainerSyncHelper() {
+  @BeforeEach
+  void setUp() {
+    metrics = ReconScmContainerSyncMetrics.create();
     syncHelper = new ReconStorageContainerSyncHelper(
         mockScmServiceProvider,
         new OzoneConfiguration(),
-        mockContainerManager
+        mockContainerManager,
+        metrics
     );
   }
 
+  @AfterEach
+  void tearDown() {
+    metrics.unRegister();
+  }
+
+  @Test
+  void testContainerSyncMetricsTrackPreSyncDriftAndDuration() throws Exception 
{
+    when(mockScmServiceProvider.getContainerCount(OPEN)).thenReturn(5L);
+    
when(mockScmServiceProvider.getContainerCount(QUASI_CLOSED)).thenReturn(1L);
+    when(mockScmServiceProvider.getContainerCount(CLOSED)).thenReturn(8L);
+    when(mockScmServiceProvider.getContainerCount(DELETED)).thenReturn(9L);
+    when(mockContainerManager.getContainerStateCount(OPEN)).thenReturn(3);
+    
when(mockContainerManager.getContainerStateCount(QUASI_CLOSED)).thenReturn(1);
+    when(mockContainerManager.getContainerStateCount(CLOSED)).thenReturn(10);
+    when(mockContainerManager.getContainerStateCount(DELETED)).thenReturn(7);
+    when(mockScmServiceProvider.getListOfContainerIDs(
+        any(), any(Integer.class), any())).thenReturn(Collections.emptyList());
+
+    boolean result = syncHelper.syncWithSCMContainerInfo();
+
+    assertTrue(result);
+    assertEquals(2L, metrics.getContainerCountDrift(OPEN));

Review Comment:
   Just a thought!
   Let me know if I missed to spot them or if they are really necessary or not.
   
   Can we add a few high-level tests for the overall SCM container sync metrics?
   
   The current tests cover the per-state metrics, like OPEN/CLOSED/DELETED 
drift and duration. But `scmContainerSyncStatus` is the main overall health 
signal for admins, so it should also be tested.
   
   Simple cases to cover:
   
   1. When sync succeeds, `scmContainerSyncStatus` should become SUCCESS.
   2. When sync returns false, `scmContainerSyncStatus` should become FAILURE.
   3. When sync throws an exception, `scmContainerSyncStatus` should still 
become FAILURE.
   4. `scmContainerSyncDurationMs` should be updated even when sync fails, 
because it is set in the `finally` block.
   
   This would verify that the high-level metric correctly tells 
Grafana/Prometheus whether the overall Recon-SCM container sync is healthy or 
failing.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to