[
https://issues.apache.org/jira/browse/HDFS-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971490#comment-14971490
]
Haohui Mai commented on HDFS-9129:
----------------------------------
{code}
+ * The state machine is briefly elaborated in the following diagram.
Specially
+ * the start status is always INITIALIZED and the end status is always OFF.
+ * There is no transition to INITIALIZED and no transition from OFF. Once
+ * entered, it will not leave THRESHOLD status until the block and datanode
+ * thresholds are met. Similarly, it will not leave EXTENSION status until
the
+ * thresholds are met and extension period is reached.
+ * ________
+ * / \
+ * thresholds not met / |
+ * INITIALIZED ---------------------------------> THRESHOLD <--`
+ * | / |
+ * | / |
+ * | / |
+ * | thresholds met / |
+ * | & / | thresholds
met
+ * thresholds | no need extension / | &
+ * met | .---------------------------` | need
extension
+ * | / |
+ * | / |
+ * | / |
+ * | / |
+ * | / |
+ * V |/_ V
+ * OFF <--------------------------------------- EXTENSION <---.
+ * thresholds met \ |
+ * & \________/
+ * extension reached
{code}
It does not give much information compared to figuring out the issues on the
code directly. What does "thresholds met" / "extensions reached" mean? It
causes more confusions than explanations.
{code}
LOG.error("Non-recognized block manager safe mode status: {}", status);
{code}
Should be an assert.
{code}
/**
* If the NN is in safemode, and not due to manual / low resources, we
* assume it must be because of startup. If the NN had low resources
during
* startup, we assume it came out of startup safemode and it is now
in low
* resources safemode.
*/
private volatile boolean isInManualSafeMode = false;
private volatile boolean isInResourceLowSafeMode = false;
...
isInManualSafeMode = !resourcesLow;
isInResourceLowSafeMode = resourcesLow;
{code}
How do these two variables synchronize? Is the system in consistent state in
the middle of the execution?
{code}
+ bmSafeMode = new BlockManagerSafeMode(bm, fsn, conf);
+ assertEquals(BMSafeModeStatus.INITIALIZED, getSafeModeStatus());
+ assertFalse(bmSafeMode.isInSafeMode());
+ // INITIALIZED -> THRESHOLD
+ bmSafeMode.setBlockTotal(BLOCK_TOTAL);
+ assertEquals(BMSafeModeStatus.THRESHOLD, getSafeModeStatus());
+ assertTrue(bmSafeMode.isInSafeMode());
{code}
It makes sense to put it in a test instead of in the {{@Before}} method.
{code}
+ // INITIALIZED -> OFF
+ Whitebox.setInternalState(bmSafeMode, "status",
+ BMSafeModeStatus.INITIALIZED);
+ reachBlockThreshold();
+ bmSafeMode.checkSafeMode();
+ assertEquals(BMSafeModeStatus.OFF, getSafeModeStatus());
+
+ // INITIALIZED -> THRESHOLD
+ Whitebox.setInternalState(bmSafeMode, "status",
+ BMSafeModeStatus.INITIALIZED);
+ Whitebox.setInternalState(bmSafeMode, "blockSafe", 0);
+ bmSafeMode.checkSafeMode();
+ assertEquals(BMSafeModeStatus.THRESHOLD, getSafeModeStatus());
+
+ // stays in THRESHOLD: pending block threshold
+ Whitebox.setInternalState(bmSafeMode, "status",
BMSafeModeStatus.THRESHOLD);
+ for (long i = 0; i < BLOCK_THRESHOLD; i++) {
+ Whitebox.setInternalState(bmSafeMode, "blockSafe", i);
+ bmSafeMode.checkSafeMode();
+ assertEquals(BMSafeModeStatus.THRESHOLD, getSafeModeStatus());
+ }
+
+ // THRESHOLD -> EXTENSION
+ Whitebox.setInternalState(bmSafeMode, "status",
BMSafeModeStatus.THRESHOLD);
+ reachBlockThreshold();
+ bmSafeMode.checkSafeMode();
+ assertEquals(BMSafeModeStatus.EXTENSION, getSafeModeStatus());
+ Whitebox.setInternalState(bmSafeMode, "smmthread", null);
+
+ // THRESHOLD -> OFF
+ Whitebox.setInternalState(bmSafeMode, "status",
BMSafeModeStatus.THRESHOLD);
+ reachBlockThreshold();
+ Whitebox.setInternalState(bmSafeMode, "needExtension", false);
+ bmSafeMode.checkSafeMode();
+ assertEquals(BMSafeModeStatus.OFF, getSafeModeStatus());
+
+ // stays in EXTENSION: pending thresholds
+ Whitebox.setInternalState(bmSafeMode, "status",
BMSafeModeStatus.EXTENSION);
+ Whitebox.setInternalState(bmSafeMode, "blockSafe", 0);
+ reachExtension();
+ bmSafeMode.checkSafeMode();
+ assertEquals(BMSafeModeStatus.EXTENSION, getSafeModeStatus());
+
+ // stays in EXTENSION: pending extension period
+ Whitebox.setInternalState(bmSafeMode, "status",
BMSafeModeStatus.EXTENSION);
+ reachBlockThreshold();
+ Whitebox.setInternalState(bmSafeMode, "extension", Integer.MAX_VALUE);
+ bmSafeMode.checkSafeMode();
+ assertEquals(BMSafeModeStatus.EXTENSION, getSafeModeStatus());
+
+ // EXTENSION -> OFF
+ Whitebox.setInternalState(bmSafeMode, "status",
BMSafeModeStatus.EXTENSION);
+ reachBlockThreshold();
+ reachExtension();
+ bmSafeMode.checkSafeMode();
+ assertEquals(BMSafeModeStatus.OFF, getSafeModeStatus());
{code}
Please refactor the code -- you can reuse the {{getSafeModeStatus()}} that is
defined by the class.
{code}
assertEquals(getBlockSafe(), i);
{code}
The expected value should go first according to the function signature.
A higher level question is that why it needs so many {{getInternalState()}}
statements? It looks to me that many of these behaviors can be observed outside
without whitebox testing.
> Move the safemode block count into BlockManager
> -----------------------------------------------
>
> Key: HDFS-9129
> URL: https://issues.apache.org/jira/browse/HDFS-9129
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Reporter: Haohui Mai
> Assignee: Mingliang Liu
> Attachments: HDFS-9129.000.patch, HDFS-9129.001.patch,
> HDFS-9129.002.patch, HDFS-9129.003.patch, HDFS-9129.004.patch,
> HDFS-9129.005.patch, HDFS-9129.006.patch, HDFS-9129.007.patch,
> HDFS-9129.008.patch, HDFS-9129.009.patch
>
>
> The {{SafeMode}} needs to track whether there are enough blocks so that the
> NN can get out of the safemode. These fields can moved to the
> {{BlockManager}} class.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)