[jira] [Commented] (HDFS-9129) Move the safemode block count into BlockManager

Haohui Mai (JIRA) Fri, 23 Oct 2015 11:08:48 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971490#comment-14971490
 ]


Haohui Mai commented on HDFS-9129:
----------------------------------

{code}
+   * The state machine is briefly elaborated in the following diagram. 
Specially
+   * the start status is always INITIALIZED and the end status is always OFF.
+   * There is no transition to INITIALIZED and no transition from OFF. Once
+   * entered, it will not leave THRESHOLD status until the block and datanode
+   * thresholds are met. Similarly, it will not leave EXTENSION status until 
the
+   * thresholds are met and extension period is reached.
+   *                                                             ________
+   *                                                            /        \
+   *                              thresholds not met           /         |
+   *         INITIALIZED ---------------------------------> THRESHOLD <--`
+   *            |                                          /    |
+   *            |                                         /     |
+   *            |                                        /      |
+   *            |             thresholds met            /       |
+   *            |                   &                  /        | thresholds 
met
+   * thresholds |            no need extension        /         |       &
+   *    met     |        .---------------------------`          | need 
extension
+   *            |       /                                       |
+   *            |      /                                        |
+   *            |     /                                         |
+   *            |    /                                          |
+   *            |   /                                           |
+   *            V |/_                                           V
+   *           OFF <--------------------------------------- EXTENSION <---.
+   *                           thresholds met                  \          |
+   *                                 &                          \________/
+   *                          extension reached
{code}

It does not give much information compared to figuring out the issues on the 
code directly. What does "thresholds met" / "extensions reached" mean? It 
causes more confusions than explanations.

{code}
LOG.error("Non-recognized block manager safe mode status: {}", status);
{code}

Should be an assert.

{code}
  /**
           * If the NN is in safemode, and not due to manual / low resources, we
           * assume it must be because of startup. If the NN had low resources 
during
           * startup, we assume it came out of startup safemode and it is now 
in low
           * resources safemode.
           */
private volatile boolean isInManualSafeMode = false;
private volatile boolean isInResourceLowSafeMode = false;

...
isInManualSafeMode = !resourcesLow;
isInResourceLowSafeMode = resourcesLow;
{code}

How do these two variables synchronize? Is the system in consistent state in 
the middle of the execution?

{code}
+    bmSafeMode = new BlockManagerSafeMode(bm, fsn, conf);
+    assertEquals(BMSafeModeStatus.INITIALIZED, getSafeModeStatus());
+    assertFalse(bmSafeMode.isInSafeMode());
+    // INITIALIZED -> THRESHOLD
+    bmSafeMode.setBlockTotal(BLOCK_TOTAL);
+    assertEquals(BMSafeModeStatus.THRESHOLD, getSafeModeStatus());
+    assertTrue(bmSafeMode.isInSafeMode());
{code}

It makes sense to put it in a test instead of in the {{@Before}} method.

{code}
+    // INITIALIZED -> OFF
+    Whitebox.setInternalState(bmSafeMode, "status",
+        BMSafeModeStatus.INITIALIZED);
+    reachBlockThreshold();
+    bmSafeMode.checkSafeMode();
+    assertEquals(BMSafeModeStatus.OFF, getSafeModeStatus());
+
+    // INITIALIZED -> THRESHOLD
+    Whitebox.setInternalState(bmSafeMode, "status",
+        BMSafeModeStatus.INITIALIZED);
+    Whitebox.setInternalState(bmSafeMode, "blockSafe", 0);
+    bmSafeMode.checkSafeMode();
+    assertEquals(BMSafeModeStatus.THRESHOLD, getSafeModeStatus());
+
+    // stays in THRESHOLD: pending block threshold
+    Whitebox.setInternalState(bmSafeMode, "status", 
BMSafeModeStatus.THRESHOLD);
+    for (long i = 0; i < BLOCK_THRESHOLD; i++) {
+      Whitebox.setInternalState(bmSafeMode, "blockSafe", i);
+      bmSafeMode.checkSafeMode();
+      assertEquals(BMSafeModeStatus.THRESHOLD, getSafeModeStatus());
+    }
+
+    // THRESHOLD -> EXTENSION
+    Whitebox.setInternalState(bmSafeMode, "status", 
BMSafeModeStatus.THRESHOLD);
+    reachBlockThreshold();
+    bmSafeMode.checkSafeMode();
+    assertEquals(BMSafeModeStatus.EXTENSION, getSafeModeStatus());
+    Whitebox.setInternalState(bmSafeMode, "smmthread", null);
+
+    // THRESHOLD -> OFF
+    Whitebox.setInternalState(bmSafeMode, "status", 
BMSafeModeStatus.THRESHOLD);
+    reachBlockThreshold();
+    Whitebox.setInternalState(bmSafeMode, "needExtension", false);
+    bmSafeMode.checkSafeMode();
+    assertEquals(BMSafeModeStatus.OFF, getSafeModeStatus());
+
+    // stays in EXTENSION: pending thresholds
+    Whitebox.setInternalState(bmSafeMode, "status", 
BMSafeModeStatus.EXTENSION);
+    Whitebox.setInternalState(bmSafeMode, "blockSafe", 0);
+    reachExtension();
+    bmSafeMode.checkSafeMode();
+    assertEquals(BMSafeModeStatus.EXTENSION, getSafeModeStatus());
+
+    // stays in EXTENSION: pending extension period
+    Whitebox.setInternalState(bmSafeMode, "status", 
BMSafeModeStatus.EXTENSION);
+    reachBlockThreshold();
+    Whitebox.setInternalState(bmSafeMode, "extension", Integer.MAX_VALUE);
+    bmSafeMode.checkSafeMode();
+    assertEquals(BMSafeModeStatus.EXTENSION, getSafeModeStatus());
+
+    // EXTENSION -> OFF
+    Whitebox.setInternalState(bmSafeMode, "status", 
BMSafeModeStatus.EXTENSION);
+    reachBlockThreshold();
+    reachExtension();
+    bmSafeMode.checkSafeMode();
+    assertEquals(BMSafeModeStatus.OFF, getSafeModeStatus());
{code}

Please refactor the code -- you can reuse the {{getSafeModeStatus()}} that is 
defined by the class.

{code}
assertEquals(getBlockSafe(), i);
{code}

The expected value should go first according to the function signature.

A higher level question is that why it needs so many {{getInternalState()}} 
statements? It looks to me that many of these behaviors can be observed outside 
without whitebox testing.

> Move the safemode block count into BlockManager
> -----------------------------------------------
>
>                 Key: HDFS-9129
>                 URL: https://issues.apache.org/jira/browse/HDFS-9129
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Haohui Mai
>            Assignee: Mingliang Liu
>         Attachments: HDFS-9129.000.patch, HDFS-9129.001.patch, 
> HDFS-9129.002.patch, HDFS-9129.003.patch, HDFS-9129.004.patch, 
> HDFS-9129.005.patch, HDFS-9129.006.patch, HDFS-9129.007.patch, 
> HDFS-9129.008.patch, HDFS-9129.009.patch
>
>
> The {{SafeMode}} needs to track whether there are enough blocks so that the 
> NN can get out of the safemode. These fields can moved to the 
> {{BlockManager}} class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-9129) Move the safemode block count into BlockManager

Reply via email to